[O] MAS: Malaysia Airlines has lost contact of MH17 from

NLP for Microblog
Summarization
KAM-FAI WONG
THE CHINESE UNIVERSITY OF HONG KONG
Outline
Part-I Introduction (Hypotheses)
Part-II Microblog Summarization
Part-III Coarse Grain M-S
Part-IV Open Questions
2
Part-I
Introduction
(Hypotheses)
3
World Facts
World Population = 7.21 Billion
World Internet penetration = 42% (3B)
World SM penetration = 29% (2.1B)
Mobile Subscription = ~100% (7.09B)
http://wearesocial.net/blog/2015/01/digital-social-mobileworldwide-2015
4
China Facts
Population = 1.37 Billion
Internet penetration = 47% (0.642B)
Active SM Account = 46% (0.629B)
Mobile subscription = 95% (1.3B)
Top SM Activities = WeChat (30%), Sina
WeiBo (25%), Tencent WeiBo (21%), Youku
(19%), Google+ (10%), Gacebook (9%)
5
Microblogging
•Microblog platforms: WeChat, twitter, etc.
•Usage: sharing (eg 打卡), event reporting,
discussion, information dissemination (eg
real-life issues, such as missing MH370,
iPhone 6s ads, etc.
•Microblog processing is useful to event
analysis, eg for e-commerce, opinion
mining, etc.
6
李晨:我們
LI Chen: We
7
Original Microblog
Reposts
8
范冰冰:我們
FAN Bingbing: We
李晨:我們
LI Chen: We
馮紹峰:恭喜晨和冰冰
FENG Shaofeng:
Congrats
to Chen and Bingbing
⽤用户5***6:幸福,在⼀一起
User5***6: Sweet love
9
Microblog Repost Tree
Repost Tree = Structure + Messages
Structure = information diffusion pattern,
microblogger relationship, context, etc.
Messages = short text (limited number of
words and lack of context)
Semantically, a repost tree organizes
fragmented text into a cohesive body
10
Hypothesis 1
Microblog text is a form of
Natural Language document.
11
[O] MAS: Malaysia Airlines has lost contact of MH17 from Amsterdam.
The last known position was over Ukrainian airspace. More details to
follow.
[R1] Hanna: OMG…Poor on
#MH17…Preying…
[R6]Taylor: Najib Razak reported that an MH
plain has crashed… I suggest MAS launch an
investigation immediately to identify the
crashed plain.
[R2]Victoria: OMG that’s horrible!!!
I'm sorry to hear that. God will all
bless u poor guys. Wish world can be
peaceful. And no one will get hurt.
[R3] Dr.Dr: Six top HIV scientists
are on MH17. They go for AIDS and
would NEVER come back!!!
[R4] TomyBlack: 6
experts died?!
Terrible loss to HIV
research :(
[R5] JustinBieber:
now i can’t listen
to #prey without
crying
[R7]MrsBig: RT
[R8] MrBig: Agree. We
should confirm
whether #MH17 has
crashed.
[R9] WindWolf: u
know, the crash may
due to the war in
Ukraine.
[R10] X-Man MH370 has not been found, and
now MH17’ s lost, here’s something
suspicious. How u guys think about this?
12
[O] MAS: Malaysia Airlines has lost contact of MH17 from Amsterdam.
The last known position was over Ukrainian airspace. More details to
follow.
[R1] Hanna: OMG…Poor on
#MH17…Preying…
[R2]Victoria: OMG that’s horrible!!!
I'm sorry to hear that. God will all
bless u poor guys. Wish world can be
peaceful. And no one will get hurt.
[R3] Dr.Dr: Six top HIV scientists
are on MH17. They go for AIDS and
would NEVER come back!!!
[R4] TomyBlack: 6
experts died?!
Terrible loss to HIV
research :(
[R5] JustinBieber:
now i can’t listen
to #prey without
crying
[R6]Taylor: Najib Razak reported that an MH
plain has crashed… I suggest MAS launch an
investigation immediately to identify the
crashed plain.
Sentence
(m-sen)
[R7]MrsBig: RT
[R8] MrBig: Agree. We
should confirm
whether #MH17 has
crashed.
[R9] WindWolf: u
know, the crash may
due to the war in
Ukraine.
[R10] X-Man MH370 has not been found,
and now MH17’ s lost, here’s something
suspicious. How u guys think about this?
13
Documen
t
(m-doc)
[O] MAS: Malaysia Airlines has lost contact of MH17 from Amsterdam.
The last known position was over Ukrainian airspace. More details to
follow.
[R6]Taylor: Najib Razak reported that an MH
plain has crashed… I suggest MAS launch an
investigation immediately to identify the
crashed plain.
[R1] Hanna: OMG…Poor on
#MH17…Preying…
[R2]Victoria: OMG that’s horrible!!!
I'm sorry to hear that. God will all
bless u poor guys. Wish world can be
peaceful. And no one will get hurt.
[R3] Dr.Dr: Six top HIV scientists
are on MH17. They go for AIDS and
would NEVER come back!!!
[R4] TomyBlack: 6
experts died?!
Terrible loss to HIV
research :(
[R5] JustinBieber:
now i can’t listen
to #prey without
crying
[R7]MrsBig: RT
[R8] MrBig: Agree. We
should confirm
whether #MH17 has
crashed.
[R9] WindWolf: u
know, the crash may
due to the war in
Ukraine.
[R10] X-Man MH370 has not been found, and
now MH17’ s lost, here’s something
suspicious. How u guys think about this?
Paragraph (mpar)
14
[O] MAS: Malaysia Airlines has lost contact of MH17 from Amsterdam.
The last known position was over Ukrainian airspace. More details to
follow.
[R1] Hanna: OMG…Poor on
#MH17…Preying…
[R6]Taylor: Najib Razak reported that an MH
plain has crashed… I suggest MAS launch an
investigation immediately to identify the
crashed plain.
[R2]Victoria: OMG that’s horrible!!!
I'm sorry to hear that. God will all
bless u poor guys. Wish world can be
peaceful. And no one will get hurt.
[R3] Dr.Dr: Six top HIV scientists
are on MH17. They go for AIDS and
would NEVER come back!!!
[R4] TomyBlack: 6
experts died?!
Terrible loss to HIV
research :(
[R5] JustinBieber:
now i can’t listen
to #prey without
crying
Sentence
(m-sen)
[R7]MrsBig: RT
[R8] MrBig: Agree. We
should confirm
whether #MH17 has
crashed.
[R9] WindWolf: u
know, the crash may
due to the war in
Ukraine.
[R10] X-Man MH370 has not been found,
and now MH17’ s lost, here’s something
suspicious. How u guys think about this?
Paragraph (mpar)
15
Documen
t
(m-doc)
Microblog as a Document
M-Document
◦– Microblog repost tree
M-Paragraph
◦– Message cluster focusing on the same
topic
M-Sentence
◦– A message on a repost tree
16
Hypothesis 2
Natural Langauage Processing (NLP)
techniques are applicable to microblogs
MICROBLOG SUMMARIZATION
17
Part II
Microblog
Summarization
18
Summarization
The goal of text summarization is to
automatically produce a succint summary
for one or more documents that preserves
important information (Radev et. al. 2002)
Abstractive and extractive summarization.
19
NLP for Summarization
Discourse processing
◦ Document as a sequence of connected sentences
Tradition coherence relations (Mann et al. 1988,
Stolcke et al. 2000)
◦ Sematic:
contrast, elaboration, cause, purpose, etc.
◦ Pragmatic:
speech acts (question, statement, respond,
etc.)
Can conventional NLP techniques be used?
20
李晨:我們
LI Chen: We
21
1,,055,55
4
22
Objective
Microblog summarization
◦ To identify salient messages and generate a
succinct summary that conserves important
information
23
Difficulties
(Chang et. al. 2013) has proven that
conventional extractive summarization
models ineffective, eg LexRank, MEAD,
tf-idf, Integer Linear Programming, etc.
Because microblog text is poor in quality:
◦ Short and noisy messages
◦ Lack of grammatical structure and context
24
Existing Work
Clustering:
Event-based (Chakrabarti et. al. 2011;
Duan et. al. 2012; shen et. al. 2013)
Topic-based (Long et. al. 2011; Rosa et.
al. 2011; Meng et. al. 2012)
25
Existing Work
Solution: Make use of social signals, eg
based on user influence and message
popularity.
Problem: these methods do not indicate
salient messages necessarily, eg
celebrities can post a popular message
with no important content
26
Existing Work
(Chang et. al. 2013) investigated Twitter
summarization:
◦ Input: tweet stream (not tree)
◦ Salient message extraction: user influence
based on user interaction (not contentbased)
◦ Method: supervised (need manual labeling)
27
Hypothesis 3
Some microbloggers (ie leaders) are
more influential than others (ie
followers).
Coarse grain microblog summarization
based on leaders-followers
28
Our Approach
Coarse grain microblog summarization
◦ Input: microblog repost tree
◦ Salient message extraction: (1) content
similarity of repost messages (sentence
level) + (2) context chorence based on
repost tree structure (discourse level)
◦ Method: unsupervised
29
Part III
Coarse Grain
Microblog
Summarization
30
Preamble
Jian Li, Wei Gao, Zhongyu Wei, Baolin
Peng and Kam-Fai Wong, “Using
Content-level Structures for
Summarizing Microblog Repost Trees”,
EMNLP2015, Lisbon, Protugal,
September 17-21, 2016, pp2168-2178.
31
Microblog Repost Tree
T = (V, E)
Nodes (V):
◦ All reposts to an original microblog post
Root (vo):
◦ The original microblog post
Edges (E):
◦ Reposting relations
32
[O] MAS: Malaysia Airlines has lost contact of MH17 from Amsterdam.
The last known position was over Ukrainian airspace. More details to
follow.
[R1] Hanna: OMG…Poor on
#MH17…Preying…
[R6]Taylor: Najib Razak reported that an MH
plain has crashed… I suggest MAS launch an
investigation immediately to identify the
crashed plain.
[R2]Victoria: OMG that’s horrible!!!
I'm sorry to hear that. God will all
bless u poor guys. Wish world can be
peaceful. And no one will get hurt.
[R3] Dr.Dr: Six top HIV scientists
are on MH17. They go for AIDS and
would NEVER come back!!!
[R4] TomyBlack: 6
experts died?!
Terrible loss to HIV
research :(
[R5] JustinBieber:
now i can’t listen
to #prey without
crying
[R7]MrsBig: RT
[R8] MrBig: Agree. We
should confirm
whether #MH17 has
crashed.
[R9] WindWolf: u
know, the crash may
due to the war in
Ukraine.
[R10] X-Man MH370 has not been found, and
now MH17’ s lost, here’s something
suspicious. How u guys think about this?
33
[O] MAS: Malaysia Airlines has lost contact of MH17 from Amsterdam.
The last known position was over Ukrainian airspace. More details to
follow.
[R1] Hanna: OMG…Poor on
#MH17…Preying…
[R6]Taylor: Najib Razak reported that an MH
plain has crashed… I suggest MAS launch an
investigation immediately to identify the
crashed plain.
[R2]Victoria: OMG that’s horrible!!!
I'm sorry to hear that. God will all
bless u poor guys. Wish world can be
peaceful. And no one will get hurt.
[R3] Dr.Dr: Six top HIV scientists
are on MH17. They go for AIDS and
would NEVER come back!!!
[R4] TomyBlack: 6
experts died?!
Terrible loss to HIV
research :(
[R5] JustinBieber:
now i can’t listen
to #prey without
crying
[R7]MrsBig: RT
[R8] MrBig: Agree. We
should confirm
whether #MH17 has
crashed.
[R9] WindWolf: u
know, the crash may
due to the war in
Ukraine.
[R10] X-Man: MH370 has not been found,
and now MH17’ s lost, here’s something
suspicious. How u guys think about it?
34
Statement
MAS has lost contact of
MH17.
Suggestion
Respond
A crashed plain found. I suggest
MAS launch an immediate
Background
Repeat
Supportinvestigation.
RT
The crash may
Agree…
due to Ukrainian
war.
MH370 has been found and MH17
is lost. There’s something
suspicious. How u guys think
about it?
Statement& Question
35
Discourse in Microblog
Tradition correlation relations (Mann et
al. 1988, Stolcke et al. 2000)
◦ Sematic: contrast, elaboration, cause,
purpose, etc.
◦ Pragmatic: speech acts (question, statement,
respond, etc.)
Model coherence relations on repost tree
◦ Coarse-grained – leaders & followers
36
[O] MAS: Malaysia Airlines has lost contact of MH17 from Amsterdam.
The last known position was over Ukrainian airspace. More details to
follow.
[R1] Hanna: OMG…Poor on
#MH17…Preying…
[R6]Taylor: Najib Razak reported that an MH
plain has crashed… I suggest MAS launch an
investigation immediately to identify the
crashed plain.
[R2]Victoria: OMG that’s horrible!!!
I'm sorry to hear that. God will all
bless u poor guys. Wish world can be
peaceful. And no one will get hurt.
[R3] Dr.Dr: Six top HIV scientists
are on MH17. They go for AIDS and
would NEVER come back!!!
[R4] TomyBlack: 6
experts died?!
Terrible loss to HIV
research :(
[R5] JustinBieber:
now i can’t listen
to #prey without
crying
[R7]MrsBig: RT
[R8] MrBig: Agree. We
should confirm
whether #MH17 has
crashed.
[R9] WindWolf: u
know, the crash may
due to the war in
Ukraine.
[R10] X-Man: MH370 has not been found,
and now MH17’ s lost, here’s something
suspicious. How u guys think about it?
37
Root
Respond
New info
Respond
MAS has lost contact of
MH17.
Leader
OMG…Poor on #MH17… Follower
OMG horrible!!! Wish world can be
peaceful.
Six top HIV scientists are on
MH17.
Experts died?! Terrible loss.Follower
38
Follower
Leader
Coarse Grain Summarizer
Two Steps:
(1) Leader Detection (CRF) +
(2) Summarization (LeadSum)
39
Step 1:
Leader detection model
O
F
F
L
40
F
Features for leader
detection
Feature
Category
Feature Description
Text-based
Type of sentence of mi (question or
exclamatory)
Microblogspecific
Path-specific
Cosine Similarity between mi and its
neighbors
Cosine Similarity
between mi and root
microblog
41
Step 2: Summarization
Basic-LeadSum model
Only
leaders
sim
sim
sim
Random
Walk
sim
sim
Repost Tree
sim
Transition probabilities based on DivRank (Mei et al.
2010):
42
Potential Problems of the
Basic-LeadSum model
Error propagation from leader detection
model
◦ Leaders misclassified as followers (False
Negative): leave out strong summary candidates
◦ Followers misidentified as leaders (False
Positive): may extract real followers in to
summary
To reduce errors cascaded from leader
detection module
◦ Enhance Basic-LeadSum to Soft-LeadSum
43
All messages
participate in
ranking process
sim
sim
sim
sim
sim
Repost Tree
sim
WALK-2
Leader?
Yes
No
Gototo parent
Go
parent
Sample from leader
probability of the
current vertex
44
WALK-1
Soft-LeadSum model
if u=v
if v is u’s ancestor
otherwise
45
Experiment set up for leader
detection
Data: 1300 reposting paths
◦ 1300 original microblogs + 4772 reposts
◦ 1000 paths for training and 300 for test
3 annotators to label leaders/followers
given repost tree paths
◦ use labels agreed by at least 2 annotators
46
Performance of leader
detection models
Cross-validation
Held-out
Prec
Recall
F1
Prec
Recall
F1
Random
.298
.495
.373
.316
.496
.386
LR
.705
.663
.684
.704
.662
.682
SVM
.709
.669
.688
.689
.662
.675
SVMhmm
.748
.655
.698
.693
.701
.697
CRF
.755
.720
.737
.711
.707
.709
47
Data collection for
# of nodes with
summarization
Name
# of nodes
comments
Height
Category
Tree (I)
Tree (II)
21,353
9,616
15,409
6,073
16
11
Social news
Social news
Tree (III)
13,087
9,583
8
Movie
Tree (IV)
12,865
7,083
8
Music
Tree (V)
10,666
7,129
8
Tree (VI)
21,127
15,057
11
Entertainment
news
Sports
news
Tree (VII)
18,974
12,399
13
Social news
Tree (VIII)
2,021
925
18
Political news
Tree (IX)
9,230
5,408
14
Breaking events
Tree (X)
10,052
4,257
25
Breaking events
48
Performance of summarization
models
ROUGE-1
ROUGE-2
F1
.159
σ
.046
SIG
**‡
F1
.037
σ
.009
SIG
**‡
RepSum
UserRankSu
m
LeadProSum
.162
.292
.071
.066
**‡
‡
.030
.087
.016
.028
**‡
†
.270
.119
‡
.064
.038
‡
SVDSum
.222
.070
**‡
.048
032
**‡
DivRankSum
.159
.079
**‡
.029
.018
**‡
UserInfSum
.272
091
‡
.071
.028
‡
B-LS+SVMhmm
.301
.031
‡
.085
.020
†
B-LS+CRF
.300
.029
‡
.082
.016
‡
S-LS+CRF
.351
.027
NA
.105
.018
NA
RandSum
49
Conclusions
Contribution:
Propose a novel framework to summarize repost trees
utilizing coarse-grained discourse on microblog repost
tree. (Corollary: NLP technique can be used in Microblog
Summarization.)
Achievements:
Introduce leader/follower concept to reduce noise on
repost trees
Propose a CRF-based leader detection model utilizing
microblogging content and context information.
Incorporate leader detection result into effective
summarization model based on random walk
50
Part IV
Open Questions
51
Several open questions
Is there any other effective features on
microblog repost trees help microblog
summarization?
◦ Locations in repost tree?
◦ Posting Time?
Can discourse in repost trees help other
NLP applications?
◦ Sentiment analysis?
◦ Reasoning and comprehension?
52
Reference
(Lafferty et al. 2001) John D. Lafferty, Andrew McCallum,
Fernando C. N. Pereira:
Conditional Random Fields: Probabilistic Models for
Segmenting and Labeling Sequence Data. ICML 2001,
282-289
(Li et al. 2015) Jing Li, Wei Gao, Zhongyu Wei, Baolin
Peng, Kam-Fai Wong:
Using Content-level Structures for Summarizing Microblog
Repost Trees. EMNLP 2015, 2168-2178
(Mann et al. 1988) William C. Mann, Sandra A. Thompson:
Rhetorical structure theory: Toward a functional theory
of text organization. Text-Interdisciplinary Journal for
the Study of Discourse 1988, 243-281.
53
Reference
(Marcu et al. 2000) Marcu, Daniel: The Theory and
Practice of Discourse and Summarization. The MIT Press
2000.
(Mei et al. 2010) Qiaozhu Mei, Jian Guo, Dragomir R.
Radev:
DivRank: the interplay of prestige and diversity in
information networks. KDD 2010, 1009-1018
54
Reference
(Radev et al 2002) Radev D., E. Hovy and K. McKeown 2002.
“Introduction to the Special Issue on Summarization”,
Computational Linguistics. 28(4):399-408.
(Stolcke et al. 2000) A. Stolcke, K. Ries, N. Coccaro, E. Shriberg,
R. A. Bates, D. Jurafsky, P. Taylor, R. Martin, C. Van Ess-Dykema,
M. Meteer:
Dialogue Act Modeling for Automatic Tagging and Recognition of
Conversational Speech. Computational linguistics, 26(3), 339-373.
(Wolf et al. 2004) Florian Wolf, Edward Gibson:
Paragraph-, Word-, and Coherence-based Approaches to Sentence
Ranking: A Comparison of Algorithm and Human Performance. ACL
2004, 383-390
55