Slides - Text Summarization

Recent Ideas in Summarization
Arzucan Ozgur
SI/EECS 767
March 12, 2010
Introduction

The goal of summarization is to take an information
source, extract content from it, and present the
most important content in a condensed form.

Types of summarization


Extractive vs. Abstractive

Single-document vs. Multi-document

Generic vs. Query-oriented
Focus on extractive multi-document summarization.
Summarization
Multi-document Summarization
D
S
Summary S
A document collection D = {D1, ..., Dn}
describing same or closely related set of
events.
Extractive Multi-document Summarization
D
S
Summary S consisting of
sentences in D, totaling
at most L words.
A document collection D = {D1, ..., Dn}
describing same or closely related set of
events.
Approaches



Statistical Models (e.g. LDA)

(Arora & Ravindran, 2008)

(Haghighi & Vanderwende, 2009)
Graph-based models

(Wan, 2008)

(Wan & Yang, 2008)
Utilizing linguistic and semantic information


(Hachey, 2009)
Utilizing external resources – Wikipedia & WordNet

(Nastase, 2008)
Exploring content models for multidocument summarization
(Haghighi & Vanderwende, 2009)
Introduction

Explore generative probabilistic models for multidocument summarization.

Begin with a simple word frequency-based model.

Construct a sequence of models each injecting more
structure into the representation of the document set
content
SumBasic [Nenkova & Vanderwende, 2006]

Simple unigram distribution is used.
Score(S) = 0.074
The Phantom Menace was a major financial success.
0.15
0.12
0.01
0.05
0.04
: initially reflects the observed unigram probabilities obtained from
the document collection
SumBasic: Algorithm
to discourage
redundancy
- Based on raw empirical unigram distribution to represent content significance.
- No distinction between words that occur many times in a single document vs.
same number of times across several documents.
KLSum
phantom: 0.15
menace: 0.12
financial: 0.05
closer to the
collection
distribution
phantom: 0.20
menace: 0.09
financial: 0.01
phantom: 0.16
menace: 0.11
financial: 0.05
KLSum
KL(P||Q): Kullback-Lieber divergence
Divergence between true distribution P (here: document set unigram
distribution) and the approximating distribution Q (the summary
distribution)
KLSum: Finding a set of summary sentences which closely match the
document set unigram distribution
: empirical unigram distribution of the candidate summary.
TopicSum

Raw unigram distribution may not best reflect the
content of the document collection for summary
extraction.

LDA (Latent Dirichlet Allocation) Topic Model
Topic Models

A generative probabilistic model (for text corpora).

Documents are mixtures of topics.

Topic is a probability distribution over words.

Procedure by which documents can be generated.
Generative Models
Latent Dirichlet Allocation
[Blei et al., 2003]
The probability density function returns the belief that the probabilities of K rival
events are ϑi given that each event has been observed αi − 1 times
Graphical Model Representation of LDA
the parameter of the
uniform Dirichlet
prior on the per-topic
word distribution
topic distribution
for document i
the parameter of the
uniform Dirichlet prior on
the per-document topic
distributions
the topic for the jth
word in document i
the specific word
(the only observable variables)
Inference
Computationally Intractable. Approximation methods are used.
Example
TopicSum Topic Model
Background distribution
the: 0.20
an: 0.15
...
financial: 0.03
phantom: 0.01
Shared across document collections.
Model stop words that do not contribute content.
TopicSum Topic Model
Content distribution
D
star: 0.20
wars: 0.15
phantom: 0.12
financial: 0.09
Represent the significant content of the document collection D.
TopicSum Topic Model
Document-specific distribution
award: 0.20
nomination: 0.15
won: 0.12
director: 0.09
financial: 0.12
million: 0.09
$: 0.05
Represent the words which are local to a single document D in
D but do not appear across several documents.
TopicSum Topic Model
Sentence
Distribution over topics:
Content (1)
DocSpecific (5)
Background (10)
Word position
Draw topic Z.
Draw a word W from the topic that Z indicates.
TopicSum

Summary sentences extracted using KLSum by
plugging in a learned content distribution in place
of the raw unigram distribution.
phantom: 0.15
menace: 0.12
financial: 0.05
HierSum Topic Model
Structured Content Models:
• General content of the document collection.
• Several sub-stories with their own specific vocabulary.
HierSum Topic Model
General or specific content word?
What specific topic?
Results
Example Summarization Output
Manual User Evaluation
Topical Summarization
Latent dirichlet allocation and singular
value decomposition based multidocument summarization.
(Arora & Ravindran, 2008)
Introduction

Similar to HierSum's structured content idea.

Set
D of related documents share a common theme
(central topic).

Documents in
D have other sub-topics which might be
common between some of the documents, which support
or give detail about the central theme.

Approach:


Use LDA find the different topics in the documents.

Documents represented as a mixture of topics.

Topics represented as a mixture of words.
Use SVD (Singular Value Decomposition) to find
sentences that best represent these topics.
LDA

Each document Dk is a mixture model over the
topics Tj -> P(Tj|Dk)

Each topic Tj is a mixture model over the word Wi
of the vocabulary -> P(Wi|Tj)

Topics are independent of each other.

Assume a sentence represents one topic:

K different topics -> K independent representations of
each sentence.
LDA
Singular Value Decomposition
- SVD is applied on the term by sentence matrix weighted by the LDA mixture
model to reduce redundancy in the summary creation.
- Entry with the highest value in V corresponds to the original vector with
greatest impact in this orthogonal direction.
- Find vectors which have the greatest impact in different orthogonal
directions.
Results
Multi-document summarization using
cluster-based link analysis
(Wan & Yang, 2008)
Graph-based multi-document summarization
Edge wights: cosine similarity
Markov Random Walk Model used to identify the salient sentences.
(e.g. LexRank [Erkan & Radev, 2004])
(Discussed in the Random Walks talk by Ahmed Hassan.)
Transition probability
Sentence Saliency
Cluster-based Link Analysis

Basic graph-based model: only sentence-level
information.

A document set consists of a number of themes
(sub-topics).

Each theme represented by a cluster of related
sentences.


Not all theme clusters are equally important.

Not all sentences in a theme cluster are equally
important.
Goal

Incorporate cluster-level information and sentence-tocluster relationships to the graph-based model
System Overview
Theme Cluster Detection
(Kmeans, Agglomerative, Divisive)
Sentence Score Computation
Summary Extraction
Cluster-based Conditional Markov Random Walk Model
Transition probability
Edge weights
Importance of the cluster:
cosine similarity between
the cluster and the
document set.
Correlation between the
sentence vi and the cluster:
cosine similarity between
the sentence and the cluster
Cluster-based HITS Model
Hubs
Authorities
wij: cosine similarity
between the sentence and
the cluster
Results
Influence of Combination Weight (λ)
Influence of number of clusters
- Conditional Markov Random Walk based model more robust than the HITS
based model.
- CMRW uses sentence-to-sentence and sentence-to-cluster relationships.
- HITS uses only sentence-to-cluster relationships -> influenced a lot by the
cluster themes detected.
An exploration of document impact on
graph-based multi-document
summarization
(Wan, 2008)
Document-based Graph Model
Edge weights
Importance of the document
in the document set.
Correlation between the
sentence vi and its
document.
Document Importance
The cosine similarity value between the document and the whole document set.
The average similarity value between the document and any other document in
the document set.
Construct a weighted similarity graph between documents and use PageRank
to compute the rank scores of the documents.
Sentence-Document Correlation
Based on the position of the sentence.
Based on the cosine similarity between the sentence and the document.
Results
Document-level analysis better than cluster-level analysis.
Influence of Combination Weight (λ)
Multi-document summarisation using
generic relation extraction
(Hachey, 2009)
Introduction

Goal: Investigate effect of various sentence
representation models on the accuracy of multidocument summarization.

Motivation: Capture deeper semantic information
using sentence representations based on
Information Extraction.

Propose a sentence representation based on
Generic Relation Extraction (GRE).
Summarization Framework
Sentence Representation Models
Results
System combination can improve results
Topic-driven multi-document
summarization with encyclopedic
knowledge and spreading activation.
(Nastase, 2008)
Introduction


Goal of topic-driven summarization is to derive a
summary from a set of documents that contains
information on a specific topic of interest to the
user.

Understanding the user's information request.

Understanding the documents to be summarized.

Understanding requires lexical, common-sense, and
encyclopedic knowledge.
Approach:

Topic expansion using Wikipedia and WordNet

Topic expansion with Spreading Activation and PageRank
Sample topics from DUC 2007
Extract Related Concepts from Wikipedia
Topic Expansion using Wikipedia and WordNet
Topic Expansion with Spreading Activation and PageRank
Results
Impact of Signal Decay in Spreading Activation
Summary and Discussion

Extractive multi-document summarization.

Different approaches

Statistical

graph-based

linguistically motivated

external resources (WordNet and Wikipedia)

Performance not directly comparable

Statistical vs. deeper understanding of text?
References

Rachit Arora and Balaraman Ravindran. Latent dirichlet allocation and singular value decomposition
based multi-document summarization. In ICDM'08: Proceedings of the 2008 Eighth IEEE International
Conference on Data Mining, pages 713-718, Washington, DC, USA, 2008.

Aria Haghighi and Lucy Vanderwende. Exploring content models for multi-document summarization. In
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American
Chapter of the Association for Computational Linguistics, pages 362-370, Boulder, Colorado, June
2009. Association for Computational Linguistics.

Xiao jun Wan and Jianwu Yang. Multi-document summarization using cluster-based link analysis. In
SIGIR'08: Proceedings of the 31st annual international ACM SIGIR conference on Research and
development in information retrieval, pages 299-306, New York, NY, USA, 2008.

Xiao jun Wan. An exploration of document impact on graph-based multi-document summarization. In
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages
755-762, Honolulu, Hawaii, October 2008. Association for Computational Linguistics.

Ben Hachey, Multi-Document Summarisation Using Generic Relation Extraction, Proceedings of the
2009 Conference on Empirical Methods in Natural Language Processing, pages 420-429, Singapore, 67 August 2009.

Vivi Nastase. Topic-driven multi-document summarization with encyclopedic knowledge and spreading
activation. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language
Processing, pages 763-772, Honolulu, Hawaii, October 2008. Association for Computational
Linguistics.

Mark Steyvers and Tom Griffiths. Probabilistic Topic Models. Lawrence Erlbaum Associates, 2007.

D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research,
3:993–1022, January 2003.
Thank you!