Recent Ideas in Summarization Arzucan Ozgur SI/EECS 767 March 12, 2010 Introduction The goal of summarization is to take an information source, extract content from it, and present the most important content in a condensed form. Types of summarization Extractive vs. Abstractive Single-document vs. Multi-document Generic vs. Query-oriented Focus on extractive multi-document summarization. Summarization Multi-document Summarization D S Summary S A document collection D = {D1, ..., Dn} describing same or closely related set of events. Extractive Multi-document Summarization D S Summary S consisting of sentences in D, totaling at most L words. A document collection D = {D1, ..., Dn} describing same or closely related set of events. Approaches Statistical Models (e.g. LDA) (Arora & Ravindran, 2008) (Haghighi & Vanderwende, 2009) Graph-based models (Wan, 2008) (Wan & Yang, 2008) Utilizing linguistic and semantic information (Hachey, 2009) Utilizing external resources – Wikipedia & WordNet (Nastase, 2008) Exploring content models for multidocument summarization (Haghighi & Vanderwende, 2009) Introduction Explore generative probabilistic models for multidocument summarization. Begin with a simple word frequency-based model. Construct a sequence of models each injecting more structure into the representation of the document set content SumBasic [Nenkova & Vanderwende, 2006] Simple unigram distribution is used. Score(S) = 0.074 The Phantom Menace was a major financial success. 0.15 0.12 0.01 0.05 0.04 : initially reflects the observed unigram probabilities obtained from the document collection SumBasic: Algorithm to discourage redundancy - Based on raw empirical unigram distribution to represent content significance. - No distinction between words that occur many times in a single document vs. same number of times across several documents. KLSum phantom: 0.15 menace: 0.12 financial: 0.05 closer to the collection distribution phantom: 0.20 menace: 0.09 financial: 0.01 phantom: 0.16 menace: 0.11 financial: 0.05 KLSum KL(P||Q): Kullback-Lieber divergence Divergence between true distribution P (here: document set unigram distribution) and the approximating distribution Q (the summary distribution) KLSum: Finding a set of summary sentences which closely match the document set unigram distribution : empirical unigram distribution of the candidate summary. TopicSum Raw unigram distribution may not best reflect the content of the document collection for summary extraction. LDA (Latent Dirichlet Allocation) Topic Model Topic Models A generative probabilistic model (for text corpora). Documents are mixtures of topics. Topic is a probability distribution over words. Procedure by which documents can be generated. Generative Models Latent Dirichlet Allocation [Blei et al., 2003] The probability density function returns the belief that the probabilities of K rival events are ϑi given that each event has been observed αi − 1 times Graphical Model Representation of LDA the parameter of the uniform Dirichlet prior on the per-topic word distribution topic distribution for document i the parameter of the uniform Dirichlet prior on the per-document topic distributions the topic for the jth word in document i the specific word (the only observable variables) Inference Computationally Intractable. Approximation methods are used. Example TopicSum Topic Model Background distribution the: 0.20 an: 0.15 ... financial: 0.03 phantom: 0.01 Shared across document collections. Model stop words that do not contribute content. TopicSum Topic Model Content distribution D star: 0.20 wars: 0.15 phantom: 0.12 financial: 0.09 Represent the significant content of the document collection D. TopicSum Topic Model Document-specific distribution award: 0.20 nomination: 0.15 won: 0.12 director: 0.09 financial: 0.12 million: 0.09 $: 0.05 Represent the words which are local to a single document D in D but do not appear across several documents. TopicSum Topic Model Sentence Distribution over topics: Content (1) DocSpecific (5) Background (10) Word position Draw topic Z. Draw a word W from the topic that Z indicates. TopicSum Summary sentences extracted using KLSum by plugging in a learned content distribution in place of the raw unigram distribution. phantom: 0.15 menace: 0.12 financial: 0.05 HierSum Topic Model Structured Content Models: • General content of the document collection. • Several sub-stories with their own specific vocabulary. HierSum Topic Model General or specific content word? What specific topic? Results Example Summarization Output Manual User Evaluation Topical Summarization Latent dirichlet allocation and singular value decomposition based multidocument summarization. (Arora & Ravindran, 2008) Introduction Similar to HierSum's structured content idea. Set D of related documents share a common theme (central topic). Documents in D have other sub-topics which might be common between some of the documents, which support or give detail about the central theme. Approach: Use LDA find the different topics in the documents. Documents represented as a mixture of topics. Topics represented as a mixture of words. Use SVD (Singular Value Decomposition) to find sentences that best represent these topics. LDA Each document Dk is a mixture model over the topics Tj -> P(Tj|Dk) Each topic Tj is a mixture model over the word Wi of the vocabulary -> P(Wi|Tj) Topics are independent of each other. Assume a sentence represents one topic: K different topics -> K independent representations of each sentence. LDA Singular Value Decomposition - SVD is applied on the term by sentence matrix weighted by the LDA mixture model to reduce redundancy in the summary creation. - Entry with the highest value in V corresponds to the original vector with greatest impact in this orthogonal direction. - Find vectors which have the greatest impact in different orthogonal directions. Results Multi-document summarization using cluster-based link analysis (Wan & Yang, 2008) Graph-based multi-document summarization Edge wights: cosine similarity Markov Random Walk Model used to identify the salient sentences. (e.g. LexRank [Erkan & Radev, 2004]) (Discussed in the Random Walks talk by Ahmed Hassan.) Transition probability Sentence Saliency Cluster-based Link Analysis Basic graph-based model: only sentence-level information. A document set consists of a number of themes (sub-topics). Each theme represented by a cluster of related sentences. Not all theme clusters are equally important. Not all sentences in a theme cluster are equally important. Goal Incorporate cluster-level information and sentence-tocluster relationships to the graph-based model System Overview Theme Cluster Detection (Kmeans, Agglomerative, Divisive) Sentence Score Computation Summary Extraction Cluster-based Conditional Markov Random Walk Model Transition probability Edge weights Importance of the cluster: cosine similarity between the cluster and the document set. Correlation between the sentence vi and the cluster: cosine similarity between the sentence and the cluster Cluster-based HITS Model Hubs Authorities wij: cosine similarity between the sentence and the cluster Results Influence of Combination Weight (λ) Influence of number of clusters - Conditional Markov Random Walk based model more robust than the HITS based model. - CMRW uses sentence-to-sentence and sentence-to-cluster relationships. - HITS uses only sentence-to-cluster relationships -> influenced a lot by the cluster themes detected. An exploration of document impact on graph-based multi-document summarization (Wan, 2008) Document-based Graph Model Edge weights Importance of the document in the document set. Correlation between the sentence vi and its document. Document Importance The cosine similarity value between the document and the whole document set. The average similarity value between the document and any other document in the document set. Construct a weighted similarity graph between documents and use PageRank to compute the rank scores of the documents. Sentence-Document Correlation Based on the position of the sentence. Based on the cosine similarity between the sentence and the document. Results Document-level analysis better than cluster-level analysis. Influence of Combination Weight (λ) Multi-document summarisation using generic relation extraction (Hachey, 2009) Introduction Goal: Investigate effect of various sentence representation models on the accuracy of multidocument summarization. Motivation: Capture deeper semantic information using sentence representations based on Information Extraction. Propose a sentence representation based on Generic Relation Extraction (GRE). Summarization Framework Sentence Representation Models Results System combination can improve results Topic-driven multi-document summarization with encyclopedic knowledge and spreading activation. (Nastase, 2008) Introduction Goal of topic-driven summarization is to derive a summary from a set of documents that contains information on a specific topic of interest to the user. Understanding the user's information request. Understanding the documents to be summarized. Understanding requires lexical, common-sense, and encyclopedic knowledge. Approach: Topic expansion using Wikipedia and WordNet Topic expansion with Spreading Activation and PageRank Sample topics from DUC 2007 Extract Related Concepts from Wikipedia Topic Expansion using Wikipedia and WordNet Topic Expansion with Spreading Activation and PageRank Results Impact of Signal Decay in Spreading Activation Summary and Discussion Extractive multi-document summarization. Different approaches Statistical graph-based linguistically motivated external resources (WordNet and Wikipedia) Performance not directly comparable Statistical vs. deeper understanding of text? References Rachit Arora and Balaraman Ravindran. Latent dirichlet allocation and singular value decomposition based multi-document summarization. In ICDM'08: Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, pages 713-718, Washington, DC, USA, 2008. Aria Haghighi and Lucy Vanderwende. Exploring content models for multi-document summarization. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 362-370, Boulder, Colorado, June 2009. Association for Computational Linguistics. Xiao jun Wan and Jianwu Yang. Multi-document summarization using cluster-based link analysis. In SIGIR'08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pages 299-306, New York, NY, USA, 2008. Xiao jun Wan. An exploration of document impact on graph-based multi-document summarization. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages 755-762, Honolulu, Hawaii, October 2008. Association for Computational Linguistics. Ben Hachey, Multi-Document Summarisation Using Generic Relation Extraction, Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 420-429, Singapore, 67 August 2009. Vivi Nastase. Topic-driven multi-document summarization with encyclopedic knowledge and spreading activation. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages 763-772, Honolulu, Hawaii, October 2008. Association for Computational Linguistics. Mark Steyvers and Tom Griffiths. Probabilistic Topic Models. Lawrence Erlbaum Associates, 2007. D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, January 2003. Thank you!
© Copyright 2024 Paperzz