Indian Journal of Science and Technology, Vol 10(8), DOI: 10.17485/ijst/2017/v10i8/108907, February 2017 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Improving Triangle-Graph Based Text Summarization using Hybrid Similarity Function Yazan Alaya AL-Khassawneh1*, Naomie Salim1 and Mutasem Jarrah1,2 Faculty of Computing, Universiti Teknologi Malaysia, 81310, Skudai, Johor, Malaysia; [email protected], [email protected] 2 Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia; [email protected] 1 Abstract Objective: Extractive Summarization, extracts the most applicable sentences from the main document, while keeping the most vital information in the document. The Graph-based techniques have become very popular for text summarisation. This paper introduces a hybrid graph based technique for single-document extractive summarization. Methods/Statistical Analysis: Prior research that utilised the graph-based approach for extractive summarisation deployed one function for computing the necessary summary. Nonetheless, in our work, we have recommended an innovative hybrid similarity function (H), for estimation purpose. This function hybridises four distinct similarity measures: cosine similarity (sim1), Jaccard similarity (sim2), word alignmentbased similarity (sim3) and the window-based similarity measure (sim4). The method uses a trainable summarizer, which takes into account several features. The effect of these features on the summarization task is investigated. Findings: By combining, the traditional similarity measures (Cosine and Jaccard) with dynamic programming approaches (word alignment-based and the window-based) for calculating the similarity between two sentences, more common information were extracted and helped to find the best sentences to be extracted in the final summary. The proposed method was evaluated using ROUGE measures on the dataset DUC2002. The experimental results showed that specific combinations of features could give higher efficiency. It also showed that some features have more effect than others on the summary creation. Applications/Improvements: The performance of this new method has been tested using the DUC 2002 data set. The effectiveness of this technique is measured using the ROUGE score, and the results are promising when compared with some existing techniques. Keywords: Extractive Summarization, Feature Extraction, Graph-Based Summarization, Hybrid Similarity, Sentence Similarity, Triangle Counting 1. Introduction With the quick expansion of Internet, a vast quantity of acquaintance is offered and reachable online. Humans extensively use internet to find information through proficient Information Retrieval (IR) tools such as Google, Yahoo, AltaVista, and so on. People have no enough time for reading all, and so far they need to have vital decisions depends on the available information there. The need for new procedures to assist people achieve and absorb the vital information in these sources become increasing imperative because the quantity and accessibility of textual information increases. The necessitate for computerized summaries is becoming more visible. Improvement in *Author for correspondence text summarization and filtration will not only permit the improvement of superior repossession systems, but also support access and analyze information on the basis of the text in multiple ways to help create a separate novelist and always access the system. There are diverse explanations for summary. In1 describes the summary as a text that is founded depends on one or more texts; it keeps the largely vital info of the original documents and its content is not more than half of the original documents. In2 defines the summarization of a text as the process of finding the major significant contents and a procedure for discovering the major foundation of information, and presenting them as an ephemeral text in the predefined prototype. In3 provide the following definition for a sum- Improving Triangle-Graph Based Text Summarization using Hybrid Similarity Function mary: “A summary can be loosely defined as a text that is produced from one or more texts that conveys important information in the original text(s), and that is no longer than half of the original text(s) and usually significantly less than that”. Texts summarizing has three main steps4. These steps are identifying the topic, interpretation, and last step is summary generation. Through identifying the topic, the mainly important info in the document is recognized. Most systems allocate diverse preference to diverse fragments of the text (words, phrases, and sentence); then the scores of every part being mixed to discover the totality degree for a part. At the end, the system includes the N highest degree fragments in concluding summary. Cue Axioms, content counting, word existence are some examples of numerous techniques for topic identification were had been studied5. Interpretation step related to Abstract summaries. In This step, associated issues are joined with a view to shape a common brief content and the extra expressions are ignored6. Concluding the topics is complicated; consequently furthermost of the systems produce the extractive summary. In the step of generating the summary, the system uses text generation procedures. This step contains a collection of diverse generation procedures from extremely straightforward word or phrase printing to more complicated expression assimilation and sentence generation. In other words, this step is to generate the natural language which is easy to be understood by the users. Based on the type of generated summary, the summarization systems are categorized. Generally speaking, the methods can be either extractive summarization or abstractive summarization. Extractive summarization involves assigning salience scores to some units (e.g. sentences, paragraphs) of the document and extracting the sentences with highest scores, while abstraction summarization (e.g. http://www1.cs.columbia.edu/nlp/ newsblaster/) usually needs information fusion, sentence compression and reformulation7,8. The summarization techniques can be classified into two groups: supervised techniques that rely on pre-existing document- summary pairs, and unsupervised techniques, based on properties and heuristics derived from the text. Supervised extractive summarization techniques treat the summarization task as a two-class classification problem at the sentence level, where the summary sentences are positive samples while the non-summary sentences 2 Vol 10 (8) | February 2017 | www.indjst.org are negative samples. After representing each sentence by a vector of features, the classification function can be trained in two different manners9. One is in a discriminative way with well-known algorithms such as Support Vector Machine (SVM)10. Many unsupervised methods have been developed for document summarization by exploiting different features and relationships of the sentences; see for example11-17. On the other hand, summarization task can also be categorized as either generic or query-based. A querybased summary presents the information that is most relevant to the given queries8,18-20. While a generic summary gives an overall sense of the document’s content, see for example8,11-14,18,20-23. Also there are two different groups of text summarization: Indicative and Informative24. Inductive summarization gives the main idea of the text to the user. The length of this type of summarization is around 5 per cent of the given text. The informative summarization system gives brief information of the main text .The length of informative summary is around 20 per cent of the given text. This work focus on extractive summaries. Extractive summaries are established by elicitation of main text slices (sentences or paragraphs) from the document, depends on analytical statistics of singular or diverse surface stage lineaments like word/phrase occurrence, position or sign words for finding the sentences that must be extracted. The “mainly vital” tenor is dealt like the “most common” or the “most positively situated” tenor. This method therefore avoids every labour on profound text perceptive. They are abstractly straightforward and simple to apply. Till now, numerous diverse procedures have been suggested to choose the most vital fragment of the text like statistical approaches; which contains Aggregation Similarity Method25, Location Method26, Frequency Method27, TF-Based Query Method28, linguistic approaches which contains Graph Theory, Lexical Chain, Word Net and Clustering. According to29,30 a graph is defined as a group of nodes and a group of edges that join a number of pairs of nodes. In the context of databases, the nodes symbolize individual components and the edges signify the relationship between these components. Triangle Counting Approach is a technique used to prune the graph, so the aim of this study is to use this technique to create a summary. The rest of this paper is organized as follows: Section 2 will discuss the related work on graph based text summa- Indian Journal of Science and Technology Yazan Alaya AL-Khassawneh, Naomie Salim and Mutasem Jarrah rization and triangle counting approach. An overview of the proposed approach is given in section 3. Experimental results will be given in section 4. Section 5 Discuss the work and finally a summary is given in Section 6. 2.1 Graph-Based Text Summarization As our study focuses primarily on the graph-based single text document extractive summarization processes, henceforth, we shall examine reports which are related to the graph-based analysis. Several reports have stated that the graph algorithms can generate document summaries very effectively. They can prove to be a technique of determining the relevance of the vertex in a graph depending on the data drawn from the document’s graph structure31. When using the graph models for the natural language text documents, build a graph which symbolises the document text and connects the words and the text having meaningful relationships31,32. The text documents of varied sizes could be used as the graph vertices, e.g., the words or sentences and relationships used for drawing connections between the 2 vertices e.g., lexical and semantic relation32, co-reference resolutions, etc. These relationships can be used for drawing the edges between the graph vertices. Then iterate the graph-based ranking algorithms like Text Rank31, PageRank33, HITS33 which can be used in the case of the undirected or the weighted graphs, for the textbased ranking application33. Thereafter, Sort the vertices depending on their end scores. The higher ranked vertices refer to the document summaries. The following types of graph models are primarily used for representing a text document. 2.1.1 Weighted Graph Model In34 suggested a sentence ranking and a clustering-based text summarization method which extracts key sentences from the text. Initially, this process clusters all the sentences in the document using the SNMF (Sparse Non-negative Matrix Factorisation) technique. Then, a weighted and undirected graph is built which uses the sentence similarities along with the discourse relationships between the statements as the weights of the edges. Thereafter, a graph-based ranking algorithm is applied for calculating the sentence scores. The higher ranked sentences for every cluster are chosen as the document summary. Equation 1 is used for calculating the rank of every vertex in the graph as follows: Vol 10 (8) | February 2017 | www.indjst.org Where, and are vertices of the graph, parameter present between 0 and 1, and (1) is a = no. of sentences; is the normalised weight which is passed from to . An algorithm was developed by35 for generating an extractive document summary for several Vietnamese documents. The algorithm constructs undirected graphs using sentences from the texts in the form of vertices while the edges represent the sentence similarities. For 2 sentences, , , the sentence similarity within them could be described by: Where; and are 2 sentences, (2) is the term fre- quency present in a document, . After applying the PageRank algorithm the sentences can be ranked based on their salient score and are chosen depending on their Maximal marginal relevance36 for generating summaries35. In the Graph-Based Algorithm suggested by33 for the text summarization process, the sentences form the graph nodes. The sentences with a similarity within them possess an edge. After the graph is built, a summary can be generated by using the smallest path starting from the first statement of the document and ending with the last statement. This then results in a smooth but a short sentence set between the 2 points. 2.1.2 Heterogeneous Graph Model The current methods do not consider the relation between the various granularities (like words, sentences, and topic). The text document summarization technique based on the heterogeneous graphs has been suggested by37. This method constructs a graph depicting a relationship between the words, sentences or topics and uses a ranking algorithm for calculating the node scores. The highest sentence scores are selected as a summary. For estimating the similarity between the 2 topics, a cosine measure is used: Indian Journal of Science and Technology 3 Improving Triangle-Graph Based Text Summarization using Hybrid Similarity Function where; and (3) are the corresponding term vectors for the topics, and . For a particular sentence and the word collection for a text, the link similarity refers to the mean similarity between the words present in a sentence and can be estimated as follows37: (4) is the similarity present where; between the words and sent in the sentence, ; while , where is a word pre- is the word present in a sentence, . For a word collection and the topic collection for a text, the link similarity refers to the mean similarity between the words and the topics present in a text and can be estimated as follows37: where; (5) is the word present in the topic , . is the word present in the whole collection of topics, For the sentence and the topic collection of text documents, the link similarity refers to the cosine similarity present between the sentences and the topic present in the text and is calculated as follows: Where is the term vector of the topic (6) , is the term vector for the sentence, . The above method has been compared to the baseline method. This method uses the relation present between the different sizes of the granularities (e.g. word, sentence and topic); hence, it demonstrates better results in comparison to the baseline method. 2.1.3 Correlation Graph Model In38 have suggested a graph-based GRAPHSUM-general purpose summariser technique. This method finds association rules for representing the correlation present 4 Vol 10 (8) | February 2017 | www.indjst.org between the multiple terms which were neglected by the earlier approaches. Moreover, the graphical nodes that represented 2 or more than 2 terms were ranked first by the PageRank strategy. Thereafter, the high ranked nodes have been used for sentence selection processes. 2.1.4 Semantic Graph Model In39 suggested a new graph-based model for document processing applications. The model depended on the 4 dimensions of semantic similarity, cosine similarity, co-reference and the discourse information for graph creation. The results have stated that the co-reference resolution improved the performance for all the cases studied with only the syntactical similarity or after adding semantic similarity39. The results also proved that semantic similarity, cosine similarity, co-reference and the discourse information produced the best data In40 developed the Sentence Rank algorithm that used the semantic and statistical analysis for the sentences for estimating the importance of the sentences for summarising a text. The algorithm built semantic graphs in which the nodes were regarded as sentences and the edges represented the semantic relation between the sentences, which were estimated using WordNet. The nodes were ranked by the ranking algorithm, with the high ranked sentences collected for a summary. 2.1.5 Query Specific Graph Model In41 developed the query-based similarity measure for the already existing graph models for the purpose of estimating the sentence to sentence edge weights and for multi-document summarization. The technique differentiated the intra-document and the inter-document sentence relation in the multi-document summarization with regards process. The importance of the sentence, to a query, is calculated as follows: (7) where, 2 sentences are represented as sentences while refers to a document, sentence vector, and (8) and refer to 2 represents the refers to the query vector. Indian Journal of Science and Technology Yazan Alaya AL-Khassawneh, Naomie Salim and Mutasem Jarrah The sentence Theorem 2: (Eigen Triangle Local) The number of was ranked using Equation 9: triangulations ∆i that participate in the node i , can be counted from the cube of the eigenvalues from the matrix generated around (9) Where , is the neighbouring sentence vertices for is the PageRank damping factor, and ranking for the sentence, is the . 2.2 Triangle Counting Approach Numerous schoolwork has been accompanied to discover or contrivance fresh algorithms for Counting Triangles in Data Graph Sets. In42 introduced an algorithm known as the node-iterator which has an execution ( d time O n 2 max ) ⊂ O(n ). The algorithm listing-ayz is the 3 record version of the majority proficient counting algo- ( ) rithm43. It has a running time of O m 3 / 2 . It uses the idea of cores. It takes a node of minimum degree, calculates its triangles in the same manner as in node-iterator and then removes the node from the graph. The execution time is ( On c 2 max ) , where c(v) is the core number of node v. Since the node-iterator core is an enhancement over the listingayz the execution time of the node-iterator-core is also ( ) O m3 / 2 . In proposed a new, highly reliable, fast and parallelizable algorithm to count triangles. The parallelizing is vital because it provides an opportunity to mine a large number of graphs using corresponding architectures such as the map/reduce (‘Hadoop’)45. The proposed method uses two algorithms as follows: Theorem 1: (Eigen Triangle) The total number of trigonometry in the graph commensurate with the cube of the total value o f self-neighbourliness matrix, as represented below: 44 ∆(G ) = 1 n 3 ∑ λi 6 i =1 Vol 10 (8) | February 2017 | www.indjst.org (10) ∑λu = 3 j 2 i, j (11) 2 Where ui,j is the j - th entry of the i - th eigen vector. In46 conducted research on the counting of triangle algorithms and proposed a collection of methods based on random sampling for approximating with high accuracy the number of triangles in 3-noded and 4-noded minors of directed and undirected graphs. The algorithm was tested on various networks from 10 different fields. Based on the rate of reoccurrence of all the minors, they have also proposed an effective network clustering algorithm. In44 conducted research on triangle counting algorithms and recommended a new approach. They devised a method for the sparsification of a graph by changing it into alternative weighted graph. The newly converted graph will have a smaller number of edges. The triangles would be calculated with the EIGENTRIANGLE method. Their method combines the concepts of the EIGENTRIANGLE with the Achlioptas-McSherr algorithms. In47 examined a new sampling algorithm for counting triangles. They executed the method on large networks, and proved speed-ups that are 70000 times faster in counting triangles. The performance of the algorithm will be precise when the densities of triangles are mild. In48 presented a high level of parallel development and a fresh indiscriminate algorithm for estimating the number of triangulations in an undirected graph. The algorithm uses the popular Monte-Carlo simulation to count the number of triangles. Each sample needs O( E ) time and ∆i j O(ε −2 log(1 δ ) ρ (G ) 2 ) (12) Samples are needed to ensure a (ε , δ ) rough calcula- tion, where ρ (G ) is a quantifier of the triangle sparsity of G. The ρ (G ) is not essentially small. The algorithm needs only O(|V|) space to work professionally. The author provided experiences to prove that in this pursuit usually only O(log2|V|) samples are critical to the accurate estimation of graphs. This algorithm is more efficient than other relevant state-of-the-art algorithms Indian Journal of Science and Technology 5 Improving Triangle-Graph Based Text Summarization using Hybrid Similarity Function for counting triangles, especially in speed and accuracy. Unfortunately this algorithm is parallel only when the critical path of O(|E|) is achievable on as few as O(log2|V|) processors. The Eigen Triangle and Eigen Triangle Local algorithms were proposed in49 to calculate the full quantity of triangles and the quantity of triangles that every node shares with an undirected graph. These algorithms are effective for all types of graphs, except when the graphs have certain spectral properties. The authors confirmed this pragmatically by conducting 160 tests on diverse kinds of actual networks. They observed important speedups between 34× and 1,075 × faster performances with 95% precision compared to an uncomplicated counting algorithm. with one of the graph-based ARM methods. It consists of four important steps; data demonstration, triangle production, bit vector demonstration, and triangle combination with the graph-based ARM technique. The performance of the suggested technique is compared with the main graph-based ARM. Experimental outcome show that the suggested technique shortens the execution time of rules production and creates less number of rules with higher assurance. 3. Overview of Approach In this section, we describe the several steps we undertook to generate the summary based on the graph-based method. The main steps that we undergo are as follows: START Document Pre-processing Document Text Segmentation Stop words removal Sentence Similarity Based on Hybrid Model Features Selection and Scoring Sentence Ranking Summary Stemming Text-Graph representation Centrality Calculation Sub-Graph Construction Evaluation END 6 1. Pre-processing 2. Graph construction 3. Centrality calculation 4. Sentence ranking 5. Summary generation 6. Evaluation and result 3.1 Proposed Graph-based Approach for Single-Document Extractive Summarization In this method, a graph-based representation is introduced as a new technique for extractive summary generation. This method has four main steps; which are: 1. Data Pre-processing a. Text segmentation b.Stop words removal c. Stemming 2. Text Graph-based Representation 3. Sub-graph construction 4. Summary generation Figure 1. The Framework of our Proposed Graph-Based Summarization Approach. 3.2 Data Pre-processing Based on the work in50 the authors projected an approach that can adjust a graph by switching it to alternative graph with fewer amount of edges and nodes. The major aspire of this study was to utilize the Triangle Counting Approach for graph-based association rules mining. A triangle counting technique for graph-based ARM is suggested to reduce the graph in the search for common item sets. The triangle counting is incorporated The initial step in text summarisation includes data preprocessing. In our study, this step of data pre-processing comprises of 3 sub-steps: text segmentation, stop word removal and the word stemming. The first step of text segmentation divides the text document into sentences. We have used the technique of stop words removal for removing meaningless words. We also used the stemming algorithm for deleting the affixes (prefixes or suffixes) present in a word for producing the root word. In this Vol 10 (8) | February 2017 | www.indjst.org Indian Journal of Science and Technology Yazan Alaya AL-Khassawneh, Naomie Salim and Mutasem Jarrah step, we extracted the important words present in the document and disregarded the rest of the words present. This could severely affect the similarity present between the documents. 3.3 Feature Extraction The textual document is symbolised by the set, D=(S1, S2,...., Sk), wherein S1 indicates a sentence found in document D. Feature extraction is then applied to the textual contents, while useful main sentence and word structures are determined. Each document features structures such as title words, sentence lengths, sentence positions, numerical data, term weights, sentence similarities, and thematic-word as well as proper-noun instances. 1. Title words: Higher scores are assigned to sentences which include words from the titles, where the contents’ meanings are conveyed in the title words. This is determined in the following manner: (13) 2. Sentence lengths: Sentences which are overly shortened such as date or author lines are removed. Each sentence’s normalised length is evaluated as: (14) 3. Sentence positions: Higher scores are assigned to those sentences which occur further ahead in their paragraphs. For every paragraph with n sentences, each sentence’s score is evaluated as: (15) 4. Numerical data: Each sentence featuring numerical terms which replicates major statistical figures in the text is designated for summarisation. Each sentence’s numerical score is evaluated as: (16) 5. Thematic words: The quantity of thematic words, or domain-specific terms exhibiting maximum-possible relativeness, found in a sentence divided by the maximum quantity found in the sentences evaluates as: Vol 10 (8) | February 2017 | www.indjst.org (17) 6. Sentence-to-sentence similarities: Token-matching techniques are utilised to calculate similarities in every sentence S along with all others. A matrix [N][N] is set up, where N is the total quantity of sentences found and the diagonal components are fixed at zero since sentences are not to be evaluated against themselves. Each sentence’s similarity score is evaluated as: (18) 3.4 Text Graph-based Representation Numerous techniques used for text representation were based on the Bag of Words, also known as the Vector Space Models (VSM)51. In the graph representation, proposed by us, the vertices refer to the sentences, whereas the edges can be defined based on the similarity present between the different sentence pairs. A graph is then constructed depending on the similarity value present between 2 sentences. The earlier works which was carried out for extractive summarization using the graph-based technique have used the cosine function for calculation of the required similarity. This function hybridises four distinct similarity measures: cosine similarity (sim1), Jaccard similarity (sim2), optimal local alignment-based similarity (sim3) and the sliding windowbased similarity measure (sim4). We can estimate the final similarity score in the following manner: (19) The above-mentioned four similarity measures are explained below: 3.4.1 Hybrid Similarity Function This sub-section will discuss the similarity measures used to for the hybrid function. • Cosine Function: In this method, a simple heuristic feature, like determination of the overall word frequency in a document or in certain phrases which indicate the sentence importance, has been used52. The relevance of words present in a sentence can be easily measured by estimating the Indian Journal of Science and Technology 7 Improving Triangle-Graph Based Text Summarization using Hybrid Similarity Function value. wherein, ing manner: and refers to can be calculated in the follow- where , (20) , in the text document, . Inverse Document Frequency (IDF) can be estimated as follows: the text, (21) is the total number of sentences present in is the overall number of sentences having the word, . To explain further, in the case of the words occurring very frequently in every sentence (like the articles “a”, “the” values near 0 while the rare etc.) would have their words (like the proper nouns or medical terms) would have larger values. Thereafter, the values for every word in a sentence is estimated and the unique words present in every folder gets extracted. Then, we build a vector for every statement present in the text, where the vector size depends on the number of the unique words that are present in the folder. Assuming are unique words present in that a specific folder, then the vector for the sentences, S1 and S2 would be: The similarity present between the S1 and the S2 can be estimated by applying the cosine formula as follows: (22) (23) The value always lies between 0-1. Hence, it values can be concluded that sentences with a high would be more similar when compared to the sentence pairs having a smaller value. • Jaccard Similarity For the two strings involved, the similarity ratio can be determined by dividing the amount of the intersecting or common terms with the amounts of the inimitable terms present in the strings53,54. Mathematically, the Similarity is estimated as follows: Ratio 8 (25) refers to the number of occurrences of the specific word, where (24) Two cases are present, wherein the similarity scores between the 2 strings could be equal to 1 or to 0 as described by Equation (25). Vol 10 (8) | February 2017 | www.indjst.org • Optimal Local Alignment Algorithm The optimal local alignment refers to a dynamic programming technique. This method determines the best local alignment present within two sequences55,56,57. In this study, we have assumed that two sequences refer to two sentences that need to be compared. The main aim of the local alignment similarity measure is capturing the phrase level similarity which is present within the two sentences. Then, the probable alignments can be comis a pared using the scoring model. Assume that score for the optimal local alignment which ends at and within proper initialisations, ing manner: , and , then, after can be estimated in the follow- (26) Based on Eq. 26, we can obtain the score matrix for the two sentences. Thereafter, we can determine the highest score which is obtained from the matrix and then, we normalise the value by applying the below-mentioned formula: and where, respectively. and (27) are the lengths of the sentences, • Windowing Algorithm The similarity between two sentences present in the text can also be determined by windowing, which refers to a dynamic programming technique58-66. This method chooses a window of a specific size (neither very small nor very large). Assuming two sentences refers to the source sentence while sentence. and wherein, refers to the target Indian Journal of Science and Technology Yazan Alaya AL-Khassawneh, Naomie Salim and Mutasem Jarrah Initially, a window of a defined size is set up for the match two sentences. Thereafter, determine if the . Then, a total number of these matches are to any determined and are stored in a match. Thereafter, a score is computed for the considered window: where, dow sizes. and (28) are the source and target win- Then, the window for the sentence, is moved to the is estimated, right-hand side by 1 word, and the and the process continues. Thereafter, an average value of the estimated scores is determined and is called as the . It can be computed as follows: (29) where, is the total window segments present in the particular sentence. Then, the window for the sentence right side by a value of one word, and the to the initial position, and the is moved to the window is reset are calculated. In addition, the above-mentioned procedure is repeated for calculating the . After this process, a final average score can be determined using all the values for the , where refers to the number of the window segments present in the source statement. (30) In this process, final score estimated is already normalized. A window size of five has been used in all our experiments. • Combining Four Similarity Measures Once all the functions required for estimating the hybrid function have been explained, we can proceed to describe the hybrid function. The hybrid function, it takes into account the average of the four different values calculated from the different functions described above. In this manner, all the functions get assigned with similar weights. This measured value ranges between 0-1, as the values are normalized before calculating the mean. Vol 10 (8) | February 2017 | www.indjst.org Thereafter, it applies a similarity measure for building a graph that denotes the similarity measure between the sentence pairs. Therefore, (31) 3.5 Graph Representation Based on the similarity measures, we will depict the pictorial representation of the graph. All the hybrid similarity values are non-zero. Also, it is assumed that every sentence is similar to itself hence. Based on the similarity values, we can build the graph for the text document. According to67, we set a scoring threshold (β) for sentence similarity at 0.5. In other meaning if the similarity value is less than 0.5, then there is no relation between sentences, which means, there is no edge between nodes. 3.6 Centrality Calculation As can be observed in the similarity matrix, all the values are non-zero. This can be explained by the fact that the sentences have been selected from the same text and could be similar to one another. Since we are primarily interested in the relevant similarities; we remove the low values after applying a threshold value. Also, bear in mind, that self-links must be present for all graph nodes as each sentence is generally similar to itself. However, we have eliminated them for the sake of readability. Every sentence centrality can be obtained by counting the similar sentences present for every sentence. Hence, sentence centrality refers to a degree of the corresponding node present in the complete graph. As noted, the choice of the hybrid threshold greatly affects the centrality interpretation. As the centrality values refer to the number of edges which are incident to the node, these values can range between one to a maximal number of sentences present in the text. Therefore, this value must be normalized to lie between 0-1. This can be done as follows: (32) where, is the sentence centrality, is the degree of a sentence (i.e., nodes present in a graph), min is the minimal degree value present in the complete folder, max is Indian Journal of Science and Technology 9 Improving Triangle-Graph Based Text Summarization using Hybrid Similarity Function the maximal degree value present in the complete folder. 3.7 Sentence Scoring For generating summaries in the extractive text summarization process, many important features are present which helpful in obtaining a summary. The summary quality depends on the selection of relevant sentences depending on the feature score. The feature efficacy is important to differentiate between all the features having a high or a low relevance. In our study, we have strived to determine the influence of the features on the summary generated. In our work we will use a combination of the six features. The combinations could be two, three, four, five or six features. We have 63 probabilities for these combinations. Hence, all the sentences have been assigned some specific weights depending on the selected features. Then, the weights are normalised to range between 0-1. Finally, a sentence score can be determined in the following manner: Sentence score = sentence centrality * Selected Features Values (33) where, c is the normalised sentence centrality value, is the total number of selected features, is the feature selected value, are the weights provided to every measure. The feature-selected value can be calculated using the following formula: (34) In this experiment, we have assumed w1- w2-1. However, they can range between 0-1 if the data is changed. This affects the resultant values which would then range between 0-1. 3.8 Sub Graph Construction Next step is to construct the triangle sub-graph. Triangles take the fact of friends of friends tend to be friends as a base line. First, we create an Adjacency Matrix. The following is the algorithm for creating the adjacency matrix. Algorithm 1. Adjacency Matrix Construction Input: Graph Data Set with N nodes or no of sentences and E edges or relationships between sentences 10 Vol 10 (8) | February 2017 | www.indjst.org Output: N*N Adjacency Matrix, showing the connections between nodes Start Determine the size of the matrix, which is N*N (N is the number of nodes in the graph) Create the matrix A For each node v ∈ N { If there is any edge from vi to vj{ A (i,j) = 1 ----- if sim (Si, Sj) > 0 Else A (i,j) = 0 ---- if there is there is no words that are similar between sentences } Stop Next, build list of triangles representing the text. To find the triangles in the graph, De-Morgan’s laws algorithm is used. The following is the algorithm of De-Morgan’s lows. Algorithm 2. De-Morgan’s Lows Input: N*N adjacency matrix, (A(I,J)) Output: Array of triangles Start Triangles_Array=[], For each edge in the matrix A(I,J), namely XY, find all edges start with Y { XY ^ YZ → XZ If XZ ∈ A(I,J) { Add the triangle of edges (X,Y,Z) to Triangles_ Array[]} } Stop 3.9 Summary Generation After sentence scores were obtained, each sentence in the document was assigned by the sentence score value. Only sentences with triangle sub-graph structure are selected for consideration as these are associated with no less than two others. Each and every sentence is graded in descending order according to its score. Those evaluated with high scores are extricated for document summarization according to compression ratio. An extraction or compression rate close to 20% of the core textual content has been demonstrated to be as instructive of the contents Indian Journal of Science and Technology Yazan Alaya AL-Khassawneh, Naomie Salim and Mutasem Jarrah as the document’s complete text68. In the final stage, the summarizing sentences are organized in the order of their conceptual occurrences as found in the initial text. The algorithm denotes the process used for generation of a summary and is based on the Maximal Marginal Relevance (MMR) concept69. Also, the ranked_sen table consists of all the statements present in the folder based on their rankings, and the sentence having the rank 1 is first, while the sentence with a rank, n is last. Initially, the sentence with rank 1 present in the ranked_sen table is added to the summary. Thereafter, every sentence present in the table was compared to the sentence which was added last in the summary table. The sentences are compared based on the hybrid similarity values (threshold value). While comparison, if it is noted that the sentences have a similarity value higher than the threshold value, then the sentence is eliminated as the sentences already denote some similarity between them. If the value is lower than the threshold value, it is included in the summary. This enables the inclusion of all probable ideas described in the text. In this manner, any person reading the summary would understand the concept conveyed by the document. 4. Experimental Results The proposed approach is evaluated in the context of single-document extractive summarization task, using 103 articles/data sets provided by the Document Understanding Evaluations 2002 (DUC, 2002). For each data set, our approach generates a summary with 20% compression rate. To compare the performance of our proposed approach, we compared these results with five benchmark summarizers: Microsoft Word 2007 summarizer, Copernic summarizer, and Best automatic summarization system in DUC 2002, worst automatic systems in DUC 2002, and the average of human model summaries (Models). ROUGE measures generate recall, precision and F-measure scores. The recall is a proportion of words in the reference summary (human) and also presents in the system summary; whereas precision is a proportion of words in the system summary and also present in the reference summary (human). For comparative evaluation, Tables 1 and 2 shows the mean coverage score (recall), average precision and average F-measure obtained on DUC 2002 dataset for the proposed approach using Vol 10 (8) | February 2017 | www.indjst.org ROUGE-1. All these values are generalized at a 95% confidence interval. Our study focused on finding the effect of features and hybrid similarity measurement to create the Triangle sub-Graph to produce the summary. Based on the generalization of the obtained results, the performance of the proposed model, it is clear that using Triangles subgraph, has the best value when using combined features. For single feature, the best value is related to Sentence to Sentence (S2S) feature, which is 0.49734% similar to human generated summaries. For combined features, the results showed that by combining all six features, the result is much better, which is 0.50102% similar to human generated summaries. In addition, it is very clear from the results that by using combined features, the results are better than using single feature. Table 1. Comparison of Single Extractive Document Summarization using ROUGE-1 result at the 95% confidence interval for Hybrid model using single feature Method Precision Recall F-Measure H2:H1 0.51656 0.51642 0.51627 MS-Word 0.47705 0.40325 0.42888 Copernic 0.46144 0.41969 0.43611 Best-System 0.50244 0.40259 0.43642 Worst-System 0.06705 0.68331 0.1209 Triangles Hybrid 0.49642 0.49852 0.49734 Table 2. Comparison of Single Extractive Document Summarization using ROUGE-1 result at the 95% confidence interval for Hybrid model using combined features Method Precision Recall F-Measure H2:H1 0.51656 0.51642 0.51627 MS-Word 0.47705 0.40325 0.42888 Copernic 0.46144 0.41969 0.43611 Best-System 0.50244 0.40259 0.43642 Worst-System 0.06705 0.68331 0.1209 Triangles Hybrid 0.49983 0.50249 0.50102 5. Discussion The experiment results of the proposed method based on hybrid modelling graph with using specific selected features show us that the resulted summaries could be Indian Journal of Science and Technology 11 Improving Triangle-Graph Based Text Summarization using Hybrid Similarity Function better than other summaries. DUC 2002 was used as a data warehouse to use some of news articles collection as input in our experimental. Three pyramid evaluation metrics (mean coverage score (recall), average precision and average F-measure) are employed for the comparative evaluation of the proposed approach and other summarization systems. In this approach, we used six different features for each sentence, and we used a hybrid function consists of four different similarity measurements to find the relations between the sentences (graph nodes) to represent the graph, then we pruned the graph by finding the triangles sub-graph and used the sentences formed this sub-graph to find the summary. The scoring process of the sentences was done based on the values of the selected features (either single feature or combined features; we have 63 probabilities of the features combinations) combined with the centrality value of each sentence in the sub-graphs. It could be observed from the results given in Tables 1 and 2 that on mean coverage score and average F-Measure, using Triangles Sub-Graph yields better summarization results on both cases, single and multi-features. Based on the experimental results of the proposed method, we can say that by identifying significant features for text summarization, and by using hybrid similarity measurement it can produce a good summary. In addition, we can say that the triangles sub-graph is the best representation for pruning the graph to create the ideal summary comparing to human summary. 6. Summary In summary, this paper has presented the use of hybrid similarity model for creating graphs to be used to find the summary of the single documents. This model was trained and tested using a collection of one-hundred and three documents gathered from the DUC2002 dataset. We used a hybrid function consists of four different similarity measurements to find the relations between sentences to great the graph. The method takes into account several features: Title words, Sentence lengths, Sentence positions, Numerical data, thematic words and Sentence-to-sentence similarities The effect of each of these sentence features on the summarization task was explored. These features are then used in grouping to create text summaries. The results of the proposed summarizer in this study were compared 12 Vol 10 (8) | February 2017 | www.indjst.org with different summarizers, such as Microsoft Word 2007 summarizer, Copernic summarizer, Best system, Worst system. The ROUGE tool kit was used to evaluate system summaries at 95% confidence intervals and extracted the results using average recall, precision, and F-measure. The F-measure was chosen as a selection criterion because it balances both recall and precision for the system’s results. The results show that the best average precision, recall and F-Measure are produced by our proposed method. The experimental results based on the proposed model show that the triangle sub-graph is the best representation to create a better summary. The experimental results based on the hybrid model used to create the graph based show that we could improve the quality of the summary. 7. Acknowledgement This research is funded by the Faculty of Computing (FC), Universiti Teknologi Malaysia (UTM) under the VOT no. R.J130000.7828.4F719. The authors would like to thank the Research Management Centre (RMC) of UTM for their support and cooperation. 8. References 1. David FS. USA: Wiley publishing, Inc: Model driven architecture: applying MDA to enterprise computing. 2003. 2. Mani I. John Benjamins Publishing : Automatic summarization. 2001; 3. 3. Radev DR, Hovy E, McKeown K. Introduction to the special issue on summarization. Computational linguistics. 2002; 28(4):399-408. Crossref 4. Lin CY, Hovy E. Identifying topics by position. The Proceedings of the fifth conference on Applied natural language processing. 1997; p. 283-90. Crossref 5. Mazdak N. Stockholm University: FarsiSum-a Persian text summarizer, Master thesis, Department of Linguistics, (PDF). 2004. 6. Langville AN, Meyer CD. Princeton University Press: Google’s PageRank and beyond: The science of search engine rankings. 2011. 7. Mani I, Maybury MT. MIT Press: Advances in automatic text summarization. 1999. 8. Wan X. Using only cross-document relationships for both generic and topic-focused multi-document summarizations. Information Retrieval. 2008; 11(1): 25-49. Crossref 9. Mihalcea R, Ceylan H. Explorations in Automatic Book Summarization. Paper presented at the EMNLP-CoNLL. 2007; p. 380-89. Indian Journal of Science and Technology Yazan Alaya AL-Khassawneh, Naomie Salim and Mutasem Jarrah 10. Yeh JY, Ke HR, Yang WP, Meng IH. Text summarization using a trainable summarizer and latent semantic analysis. Information processing & management. 2005; 41(1):75-95. Crossref 11. Alguliev R, Bagirov A. Global optimization in the summarization of text documents. Automatic Control and Computer Sciences. 2005; 39(6):42-47. 12. Alguliev RM, Aliguliyev RM. Effective summarization method of text documents. Paper presented at the The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI’05). 2005; p. 264-71. Crossref 13. Aliguliyev RM. A novel partitioning-based clustering method and generic document summarization. Paper presented at the Proceedings of the 2006 IEEE/WIC/ ACM international conference on Web Intelligence and Intelligent Agent Technology. 2006; p. 626-29. Crossref 14. Aliguliyev RM. Automatic document summarization by sentence extraction. Вычислительные технологии. 2007; 12(5). 15. Erkan G, Radev DR. LexPageRank: Prestige in MultiDocument Text Summarization. Paper presented at the EMNLP. 2004; p. 365-71. 16. Erkan G, Radev DR. LexRank: Graph-based lexical centrality as salience in text summarization. Journal of Artificial Intelligence Research. 2004; 22: 457-79. 17. Radev DR, Jing H, Stys M, Tam D. Centroid-based summarization of multiple documents. Information Processing & Management. 2004; 40(6):919-38. Crossref 18. Dunlavy DM, O’Leary DP, Conroy JM, Schlesinger JD. QCS: A system for querying, clustering and summarizing documents. Information Processing & Management. 2007; 43(6):1588-605. https://doi.org/10.1016/j.ipm.2007.01.003 19. Fisher S, Roark B. Query-focused summarization by supervised sentence ranking and skewed word distributions. New York, USA: the Proceedings of the Document Understanding Conference. DUC-2006. 2006. 20. Li J, Sun L, Kit C, Webster J. The Proc. of Document Understanding Conference: A query-focused multi-document summarizer based on lexical chains. 2007. 21. Gong Y, Liu X. Generic text summarization using relevance measure and latent semantic analysis. The Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval. 2001, p. 19-25. Crossref 22. Jones KS. Automatic summarising: The state of the art. Information Processing & Management. 2007; 43(6):144981. Crossref 23. Salton G, Buckley C. Improving retrieval performance by relevance feedback. Readings in information retrieval. 1997; 24(5):355-63. 24. Gupta V, Lehal GS. A survey of text summarization extractive techniques. Journal of emerging technologies in web intelligence. 2010; 2(3): 258-68. Crossref Vol 10 (8) | February 2017 | www.indjst.org 25. Crovella ME, Bestavros A. IEEE/ACM Transactions on networking: Self-similarity in World Wide Web traffic: evidence and possible causes. 1997; 5(6):835-46. 26. Kupiec JM, Schuetze H. Google Patents: System for genrespecific summarization of documents. 2004. 27. Mihalcea R. Graph-based ranking algorithms for sentence extraction, applied to text summarization. The Proceedings of the ACL 2004 on Interactive poster and demonstration sessions. 2004; 20. Crossref 28. Patil K, Brazdil P. Text summarization: Using centrality in the pathfinder network. International Journal Computional Science Information System. 2007; 2:18-32. 29. Cover TM, Thomas JA. John Wiley & Sons: Elements of information theory. 2012. 30. McKee T, McMorris F. Topics in Intersection Graph Theory (SIAM Monographs on Discrete Mathematics and Applications). Society for Industrial and Applied Mathmatics. 1999. Crossref 31. Mihalcea R, Tarau P. TextRank: Bringing order into texts. 2004. 32. Sonawane S, Kulkarni P. Graph based Representation and Analysis of Text Document: A Survey of Techniques. International Journal of Computer Applications. 2014; 96(19):1. Crossref 33. Thakkar KS, Dharaskar RV, Chandak M. Graph-based algorithms for text summarization. 2010 3rd International Conference: Paper presented at the Emerging Trends in Engineering and Technology (ICETET). 2010; p. 516-19. 34. Ge SS, Zhang Z, He H. Weighted graph model based sentence clustering and ranking for document summarization. 2011 4th International Conference on the Interaction Sciences (ICIS). 2011; p. 90-95. 35. Hoang TAN, Nguyen HK, Tran QV. An efficient vietnamese text summarization approach based on graph model. 2010 IEEE RIVF International Conference on Computing and Communication Technologies, Research, Innovation, and Vision for the Future (RIVF). 2010; p. 1-6. 36. Nomoto T, Matsumoto Y. A new approach to unsupervised text summarization. The Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval. 2001; p. 26-34. Crossref 37. Wei Y. Document summarization method based on heterogeneous graph. 2012 9th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD). 2012; p. 1285-89. 38. Baralis E, Cagliero L, Mahoto N, Fiori A. GRAPHSUM: Discovering correlations among multiple terms for graph-based summarization. Information Sciences. 2013; 249:96-109. Crossref 39. Ferreira R, Freitas F, Cabral SL, Lins RD, Lima R, França G. A four dimension graph model for automatic text summarization. 2013 IEEE/WIC/ACM International Joint Indian Journal of Science and Technology 13 Improving Triangle-Graph Based Text Summarization using Hybrid Similarity Function Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT). 2013; p. 389-96. 40. Ramesh A, Srinivasa K, Pramod N. SentenceRank - A graph based approach to summarize text. 2014 Fifth International Conference on Applications of Digital Information and Web Technologies (ICADIWT). 2014; p. 177-82. Crossref 41. Wei F, He Y, Li W, Lu Q. A Query-Sensitive Graph-Based Sentence Ranking Algorithm for Query-Oriented Multidocument Summarization. 2008 International Symposiums on Information Processing (ISIP). 2008; p. 9-13. 42. Schank T, Wagner D. Finding, counting and listing all triangles in large graphs, an experimental study. International Workshop on Experimental and Efficient Algorithms. 2005; p. 606-09. Crossref 43. Alon N, Matias Y, Szegedy M. The space complexity of approximating the frequency moments. The Proceedings of the twenty-eighth annual ACM symposium on Theory of computing. 1996; p. 20-29. Crossref 44. Tsourakakis CE. Fast counting of triangles in large real networks without counting: Algorithms and laws. The 2008 Eighth IEEE International Conference on Data Mining. 2008; p. 608-17. 45. Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters. Communications of the ACM. 2008; 51(1):107-13. Crossref 46. Bordino I, Donato D, Gionis A, Leonardi S. Mining large networks with subgraph counting. Paper presented at the 2008 Eighth IEEE International Conference on Data Mining. 2008; p. 737-42. Crossref 47. Tsourakakis CE, Kang U, Miller GL, Faloutsos C. Doulion: counting triangles in massive graphs with a coin. The Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. 2009; 837-46. Crossref 48. Avron H. Counting triangles in large graphs using randomized matrix trace estimation. The Workshop on Large-scale Data Mining: Theory and Applications. 2010; p. 1-10. 49. Tsourakakis CE. MACH: Fast Randomized Tensor Decompositions. Paper presented at the SDM. 2010; p. 689700. 50. Al-Khassawneh YAJ, Bakar AA, Zainudin S. Triangle counting approach for graph-based association rules mining. 2012 9th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD). 2012; p. 661-65. 51. Schenker A, Kandel A, Bunke H, Last M. Graph-theoretic techniques for web content mining World Scientific. 2005; 62. 52. Qian G, Sural S, Gu Y, Pramanik S.Similarity between Euclidean and cosine angle distance for nearest neighbor queries. The Proceedings of the 2004 ACM symposium on Applied computing. 2004; p. 1232-37. Crossref 14 Vol 10 (8) | February 2017 | www.indjst.org 53. Hajeer I. Comparison on the Effectiveness of Different Statistical Similarity Measures. International Journal of Computer Applications. 2012; 53(8):1. Crossref 54. Manusnanth P, Arj-in S. Document clustering results on the semantic web search. The Proceedings of the 5th National Conference on Computing and Information Technology. 2009. 55. Attwood TK, Parry-Smith DJ. Introduction to bioinformatics: Prentice Hall. 2003. 56. Higgs PG, Attwood TK. Bioinformatics and molecular evolution: John Wiley & Sons.2013. 57. Smith TF, Waterman MS. Identification of common molecular subsequences. Journal of molecular biology. 1981; 147(1):195-97. Crossref 58. Chavez A, Davila H, Gutierrez Y, Fernandez-Orquín A, Montoyo A, Mu-oz R. Umcc_dlsi_semsim: Multilingual system for measuring semantic textual similarity. The Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014). 2014; p. 716-21. 59. Cremonesi P, Koren Y, Turrin R. Performance of recommender algorithms on top-n recommendation tasks. The Proceedings of the fourth ACM conference on Recommender systems. 2010; p. 39-46. Crossref 60. Datar M, Gionis A, Indyk P, Motwani R. Maintaining stream statistics over sliding windows. SIAM journal on computing. 2002; 31(6):1794-813. Crossref 61. Grefenstette G. Evaluation techniques for automatic semantic extraction: comparing syntactic and window based approaches. Columbus, Ohio: The Proc. of the SIGLEX Workshop on Acquisition of Lexical Knowledge from Text. 1993. 62. Koren Y. Factorization meets the neighborhood: A multifaceted collaborative filtering model. The Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. 2008; p. 426-34. Crossref 63. Quinlan JR. Edinburgh University Press: Discovering rules by induction from large collections of examples: Expert systems in the micro electronic age. 1979. 64. Quilan J. Learning efficient classification procedures and their application to chess end games. Machine Learning: An Artificial Intelligence Approach, 1. 1983. 65. Wirth J, Catlett J. Experiments on the Costs and Benefits of Windowing in ID3. Paper presented at the ML. 1988; p. 87-99. Crossref 66. Wu TJ, Huang YH, Li LA. Bioinformatics : Optimal word sizes for dissimilarity measures and estimation of the degree of dissimilarity between DNA sequences.. 2005; 21(22):4125-32. 67. Mihalcea R, Corley C, Strapparava C. Boston, Massachusetts: AAAI Press: Corpus-based and knowledge- Indian Journal of Science and Technology Yazan Alaya AL-Khassawneh, Naomie Salim and Mutasem Jarrah based measures of text semantic similarity. Association for the Advancement of Artificial Intelligence. 2006 July 16-20; p. 775-80. 68. Morris AH, Kasper GM, Adams DA. The effects and limitations of automated text condensing on reading comprehension performance. Information Systems Research. 1992; 3(1):17-35. Crossref Vol 10 (8) | February 2017 | www.indjst.org 69. Carbonell J, Goldstein J. The use of MMR, diversity-based reranking for reordering documents and producing summaries. The Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval. 1998; p. 335-36. Crossref Indian Journal of Science and Technology 15
© Copyright 2025 Paperzz