NLP for Microblog Summarization KAM-FAI WONG THE CHINESE UNIVERSITY OF HONG KONG Outline Part-I Introduction (Hypotheses) Part-II Microblog Summarization Part-III Coarse Grain M-S Part-IV Open Questions 2 Part-I Introduction (Hypotheses) 3 World Facts World Population = 7.21 Billion World Internet penetration = 42% (3B) World SM penetration = 29% (2.1B) Mobile Subscription = ~100% (7.09B) http://wearesocial.net/blog/2015/01/digital-social-mobileworldwide-2015 4 China Facts Population = 1.37 Billion Internet penetration = 47% (0.642B) Active SM Account = 46% (0.629B) Mobile subscription = 95% (1.3B) Top SM Activities = WeChat (30%), Sina WeiBo (25%), Tencent WeiBo (21%), Youku (19%), Google+ (10%), Gacebook (9%) 5 Microblogging •Microblog platforms: WeChat, twitter, etc. •Usage: sharing (eg 打卡), event reporting, discussion, information dissemination (eg real-life issues, such as missing MH370, iPhone 6s ads, etc. •Microblog processing is useful to event analysis, eg for e-commerce, opinion mining, etc. 6 李晨:我們 LI Chen: We 7 Original Microblog Reposts 8 范冰冰:我們 FAN Bingbing: We 李晨:我們 LI Chen: We 馮紹峰:恭喜晨和冰冰 FENG Shaofeng: Congrats to Chen and Bingbing ⽤用户5***6:幸福,在⼀一起 User5***6: Sweet love 9 Microblog Repost Tree Repost Tree = Structure + Messages Structure = information diffusion pattern, microblogger relationship, context, etc. Messages = short text (limited number of words and lack of context) Semantically, a repost tree organizes fragmented text into a cohesive body 10 Hypothesis 1 Microblog text is a form of Natural Language document. 11 [O] MAS: Malaysia Airlines has lost contact of MH17 from Amsterdam. The last known position was over Ukrainian airspace. More details to follow. [R1] Hanna: OMG…Poor on #MH17…Preying… [R6]Taylor: Najib Razak reported that an MH plain has crashed… I suggest MAS launch an investigation immediately to identify the crashed plain. [R2]Victoria: OMG that’s horrible!!! I'm sorry to hear that. God will all bless u poor guys. Wish world can be peaceful. And no one will get hurt. [R3] Dr.Dr: Six top HIV scientists are on MH17. They go for AIDS and would NEVER come back!!! [R4] TomyBlack: 6 experts died?! Terrible loss to HIV research :( [R5] JustinBieber: now i can’t listen to #prey without crying [R7]MrsBig: RT [R8] MrBig: Agree. We should confirm whether #MH17 has crashed. [R9] WindWolf: u know, the crash may due to the war in Ukraine. [R10] X-Man MH370 has not been found, and now MH17’ s lost, here’s something suspicious. How u guys think about this? 12 [O] MAS: Malaysia Airlines has lost contact of MH17 from Amsterdam. The last known position was over Ukrainian airspace. More details to follow. [R1] Hanna: OMG…Poor on #MH17…Preying… [R2]Victoria: OMG that’s horrible!!! I'm sorry to hear that. God will all bless u poor guys. Wish world can be peaceful. And no one will get hurt. [R3] Dr.Dr: Six top HIV scientists are on MH17. They go for AIDS and would NEVER come back!!! [R4] TomyBlack: 6 experts died?! Terrible loss to HIV research :( [R5] JustinBieber: now i can’t listen to #prey without crying [R6]Taylor: Najib Razak reported that an MH plain has crashed… I suggest MAS launch an investigation immediately to identify the crashed plain. Sentence (m-sen) [R7]MrsBig: RT [R8] MrBig: Agree. We should confirm whether #MH17 has crashed. [R9] WindWolf: u know, the crash may due to the war in Ukraine. [R10] X-Man MH370 has not been found, and now MH17’ s lost, here’s something suspicious. How u guys think about this? 13 Documen t (m-doc) [O] MAS: Malaysia Airlines has lost contact of MH17 from Amsterdam. The last known position was over Ukrainian airspace. More details to follow. [R6]Taylor: Najib Razak reported that an MH plain has crashed… I suggest MAS launch an investigation immediately to identify the crashed plain. [R1] Hanna: OMG…Poor on #MH17…Preying… [R2]Victoria: OMG that’s horrible!!! I'm sorry to hear that. God will all bless u poor guys. Wish world can be peaceful. And no one will get hurt. [R3] Dr.Dr: Six top HIV scientists are on MH17. They go for AIDS and would NEVER come back!!! [R4] TomyBlack: 6 experts died?! Terrible loss to HIV research :( [R5] JustinBieber: now i can’t listen to #prey without crying [R7]MrsBig: RT [R8] MrBig: Agree. We should confirm whether #MH17 has crashed. [R9] WindWolf: u know, the crash may due to the war in Ukraine. [R10] X-Man MH370 has not been found, and now MH17’ s lost, here’s something suspicious. How u guys think about this? Paragraph (mpar) 14 [O] MAS: Malaysia Airlines has lost contact of MH17 from Amsterdam. The last known position was over Ukrainian airspace. More details to follow. [R1] Hanna: OMG…Poor on #MH17…Preying… [R6]Taylor: Najib Razak reported that an MH plain has crashed… I suggest MAS launch an investigation immediately to identify the crashed plain. [R2]Victoria: OMG that’s horrible!!! I'm sorry to hear that. God will all bless u poor guys. Wish world can be peaceful. And no one will get hurt. [R3] Dr.Dr: Six top HIV scientists are on MH17. They go for AIDS and would NEVER come back!!! [R4] TomyBlack: 6 experts died?! Terrible loss to HIV research :( [R5] JustinBieber: now i can’t listen to #prey without crying Sentence (m-sen) [R7]MrsBig: RT [R8] MrBig: Agree. We should confirm whether #MH17 has crashed. [R9] WindWolf: u know, the crash may due to the war in Ukraine. [R10] X-Man MH370 has not been found, and now MH17’ s lost, here’s something suspicious. How u guys think about this? Paragraph (mpar) 15 Documen t (m-doc) Microblog as a Document M-Document ◦– Microblog repost tree M-Paragraph ◦– Message cluster focusing on the same topic M-Sentence ◦– A message on a repost tree 16 Hypothesis 2 Natural Langauage Processing (NLP) techniques are applicable to microblogs MICROBLOG SUMMARIZATION 17 Part II Microblog Summarization 18 Summarization The goal of text summarization is to automatically produce a succint summary for one or more documents that preserves important information (Radev et. al. 2002) Abstractive and extractive summarization. 19 NLP for Summarization Discourse processing ◦ Document as a sequence of connected sentences Tradition coherence relations (Mann et al. 1988, Stolcke et al. 2000) ◦ Sematic: contrast, elaboration, cause, purpose, etc. ◦ Pragmatic: speech acts (question, statement, respond, etc.) Can conventional NLP techniques be used? 20 李晨:我們 LI Chen: We 21 1,,055,55 4 22 Objective Microblog summarization ◦ To identify salient messages and generate a succinct summary that conserves important information 23 Difficulties (Chang et. al. 2013) has proven that conventional extractive summarization models ineffective, eg LexRank, MEAD, tf-idf, Integer Linear Programming, etc. Because microblog text is poor in quality: ◦ Short and noisy messages ◦ Lack of grammatical structure and context 24 Existing Work Clustering: Event-based (Chakrabarti et. al. 2011; Duan et. al. 2012; shen et. al. 2013) Topic-based (Long et. al. 2011; Rosa et. al. 2011; Meng et. al. 2012) 25 Existing Work Solution: Make use of social signals, eg based on user influence and message popularity. Problem: these methods do not indicate salient messages necessarily, eg celebrities can post a popular message with no important content 26 Existing Work (Chang et. al. 2013) investigated Twitter summarization: ◦ Input: tweet stream (not tree) ◦ Salient message extraction: user influence based on user interaction (not contentbased) ◦ Method: supervised (need manual labeling) 27 Hypothesis 3 Some microbloggers (ie leaders) are more influential than others (ie followers). Coarse grain microblog summarization based on leaders-followers 28 Our Approach Coarse grain microblog summarization ◦ Input: microblog repost tree ◦ Salient message extraction: (1) content similarity of repost messages (sentence level) + (2) context chorence based on repost tree structure (discourse level) ◦ Method: unsupervised 29 Part III Coarse Grain Microblog Summarization 30 Preamble Jian Li, Wei Gao, Zhongyu Wei, Baolin Peng and Kam-Fai Wong, “Using Content-level Structures for Summarizing Microblog Repost Trees”, EMNLP2015, Lisbon, Protugal, September 17-21, 2016, pp2168-2178. 31 Microblog Repost Tree T = (V, E) Nodes (V): ◦ All reposts to an original microblog post Root (vo): ◦ The original microblog post Edges (E): ◦ Reposting relations 32 [O] MAS: Malaysia Airlines has lost contact of MH17 from Amsterdam. The last known position was over Ukrainian airspace. More details to follow. [R1] Hanna: OMG…Poor on #MH17…Preying… [R6]Taylor: Najib Razak reported that an MH plain has crashed… I suggest MAS launch an investigation immediately to identify the crashed plain. [R2]Victoria: OMG that’s horrible!!! I'm sorry to hear that. God will all bless u poor guys. Wish world can be peaceful. And no one will get hurt. [R3] Dr.Dr: Six top HIV scientists are on MH17. They go for AIDS and would NEVER come back!!! [R4] TomyBlack: 6 experts died?! Terrible loss to HIV research :( [R5] JustinBieber: now i can’t listen to #prey without crying [R7]MrsBig: RT [R8] MrBig: Agree. We should confirm whether #MH17 has crashed. [R9] WindWolf: u know, the crash may due to the war in Ukraine. [R10] X-Man MH370 has not been found, and now MH17’ s lost, here’s something suspicious. How u guys think about this? 33 [O] MAS: Malaysia Airlines has lost contact of MH17 from Amsterdam. The last known position was over Ukrainian airspace. More details to follow. [R1] Hanna: OMG…Poor on #MH17…Preying… [R6]Taylor: Najib Razak reported that an MH plain has crashed… I suggest MAS launch an investigation immediately to identify the crashed plain. [R2]Victoria: OMG that’s horrible!!! I'm sorry to hear that. God will all bless u poor guys. Wish world can be peaceful. And no one will get hurt. [R3] Dr.Dr: Six top HIV scientists are on MH17. They go for AIDS and would NEVER come back!!! [R4] TomyBlack: 6 experts died?! Terrible loss to HIV research :( [R5] JustinBieber: now i can’t listen to #prey without crying [R7]MrsBig: RT [R8] MrBig: Agree. We should confirm whether #MH17 has crashed. [R9] WindWolf: u know, the crash may due to the war in Ukraine. [R10] X-Man: MH370 has not been found, and now MH17’ s lost, here’s something suspicious. How u guys think about it? 34 Statement MAS has lost contact of MH17. Suggestion Respond A crashed plain found. I suggest MAS launch an immediate Background Repeat Supportinvestigation. RT The crash may Agree… due to Ukrainian war. MH370 has been found and MH17 is lost. There’s something suspicious. How u guys think about it? Statement& Question 35 Discourse in Microblog Tradition correlation relations (Mann et al. 1988, Stolcke et al. 2000) ◦ Sematic: contrast, elaboration, cause, purpose, etc. ◦ Pragmatic: speech acts (question, statement, respond, etc.) Model coherence relations on repost tree ◦ Coarse-grained – leaders & followers 36 [O] MAS: Malaysia Airlines has lost contact of MH17 from Amsterdam. The last known position was over Ukrainian airspace. More details to follow. [R1] Hanna: OMG…Poor on #MH17…Preying… [R6]Taylor: Najib Razak reported that an MH plain has crashed… I suggest MAS launch an investigation immediately to identify the crashed plain. [R2]Victoria: OMG that’s horrible!!! I'm sorry to hear that. God will all bless u poor guys. Wish world can be peaceful. And no one will get hurt. [R3] Dr.Dr: Six top HIV scientists are on MH17. They go for AIDS and would NEVER come back!!! [R4] TomyBlack: 6 experts died?! Terrible loss to HIV research :( [R5] JustinBieber: now i can’t listen to #prey without crying [R7]MrsBig: RT [R8] MrBig: Agree. We should confirm whether #MH17 has crashed. [R9] WindWolf: u know, the crash may due to the war in Ukraine. [R10] X-Man: MH370 has not been found, and now MH17’ s lost, here’s something suspicious. How u guys think about it? 37 Root Respond New info Respond MAS has lost contact of MH17. Leader OMG…Poor on #MH17… Follower OMG horrible!!! Wish world can be peaceful. Six top HIV scientists are on MH17. Experts died?! Terrible loss.Follower 38 Follower Leader Coarse Grain Summarizer Two Steps: (1) Leader Detection (CRF) + (2) Summarization (LeadSum) 39 Step 1: Leader detection model O F F L 40 F Features for leader detection Feature Category Feature Description Text-based Type of sentence of mi (question or exclamatory) Microblogspecific Path-specific Cosine Similarity between mi and its neighbors Cosine Similarity between mi and root microblog 41 Step 2: Summarization Basic-LeadSum model Only leaders sim sim sim Random Walk sim sim Repost Tree sim Transition probabilities based on DivRank (Mei et al. 2010): 42 Potential Problems of the Basic-LeadSum model Error propagation from leader detection model ◦ Leaders misclassified as followers (False Negative): leave out strong summary candidates ◦ Followers misidentified as leaders (False Positive): may extract real followers in to summary To reduce errors cascaded from leader detection module ◦ Enhance Basic-LeadSum to Soft-LeadSum 43 All messages participate in ranking process sim sim sim sim sim Repost Tree sim WALK-2 Leader? Yes No Gototo parent Go parent Sample from leader probability of the current vertex 44 WALK-1 Soft-LeadSum model if u=v if v is u’s ancestor otherwise 45 Experiment set up for leader detection Data: 1300 reposting paths ◦ 1300 original microblogs + 4772 reposts ◦ 1000 paths for training and 300 for test 3 annotators to label leaders/followers given repost tree paths ◦ use labels agreed by at least 2 annotators 46 Performance of leader detection models Cross-validation Held-out Prec Recall F1 Prec Recall F1 Random .298 .495 .373 .316 .496 .386 LR .705 .663 .684 .704 .662 .682 SVM .709 .669 .688 .689 .662 .675 SVMhmm .748 .655 .698 .693 .701 .697 CRF .755 .720 .737 .711 .707 .709 47 Data collection for # of nodes with summarization Name # of nodes comments Height Category Tree (I) Tree (II) 21,353 9,616 15,409 6,073 16 11 Social news Social news Tree (III) 13,087 9,583 8 Movie Tree (IV) 12,865 7,083 8 Music Tree (V) 10,666 7,129 8 Tree (VI) 21,127 15,057 11 Entertainment news Sports news Tree (VII) 18,974 12,399 13 Social news Tree (VIII) 2,021 925 18 Political news Tree (IX) 9,230 5,408 14 Breaking events Tree (X) 10,052 4,257 25 Breaking events 48 Performance of summarization models ROUGE-1 ROUGE-2 F1 .159 σ .046 SIG **‡ F1 .037 σ .009 SIG **‡ RepSum UserRankSu m LeadProSum .162 .292 .071 .066 **‡ ‡ .030 .087 .016 .028 **‡ † .270 .119 ‡ .064 .038 ‡ SVDSum .222 .070 **‡ .048 032 **‡ DivRankSum .159 .079 **‡ .029 .018 **‡ UserInfSum .272 091 ‡ .071 .028 ‡ B-LS+SVMhmm .301 .031 ‡ .085 .020 † B-LS+CRF .300 .029 ‡ .082 .016 ‡ S-LS+CRF .351 .027 NA .105 .018 NA RandSum 49 Conclusions Contribution: Propose a novel framework to summarize repost trees utilizing coarse-grained discourse on microblog repost tree. (Corollary: NLP technique can be used in Microblog Summarization.) Achievements: Introduce leader/follower concept to reduce noise on repost trees Propose a CRF-based leader detection model utilizing microblogging content and context information. Incorporate leader detection result into effective summarization model based on random walk 50 Part IV Open Questions 51 Several open questions Is there any other effective features on microblog repost trees help microblog summarization? ◦ Locations in repost tree? ◦ Posting Time? Can discourse in repost trees help other NLP applications? ◦ Sentiment analysis? ◦ Reasoning and comprehension? 52 Reference (Lafferty et al. 2001) John D. Lafferty, Andrew McCallum, Fernando C. N. Pereira: Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. ICML 2001, 282-289 (Li et al. 2015) Jing Li, Wei Gao, Zhongyu Wei, Baolin Peng, Kam-Fai Wong: Using Content-level Structures for Summarizing Microblog Repost Trees. EMNLP 2015, 2168-2178 (Mann et al. 1988) William C. Mann, Sandra A. Thompson: Rhetorical structure theory: Toward a functional theory of text organization. Text-Interdisciplinary Journal for the Study of Discourse 1988, 243-281. 53 Reference (Marcu et al. 2000) Marcu, Daniel: The Theory and Practice of Discourse and Summarization. The MIT Press 2000. (Mei et al. 2010) Qiaozhu Mei, Jian Guo, Dragomir R. Radev: DivRank: the interplay of prestige and diversity in information networks. KDD 2010, 1009-1018 54 Reference (Radev et al 2002) Radev D., E. Hovy and K. McKeown 2002. “Introduction to the Special Issue on Summarization”, Computational Linguistics. 28(4):399-408. (Stolcke et al. 2000) A. Stolcke, K. Ries, N. Coccaro, E. Shriberg, R. A. Bates, D. Jurafsky, P. Taylor, R. Martin, C. Van Ess-Dykema, M. Meteer: Dialogue Act Modeling for Automatic Tagging and Recognition of Conversational Speech. Computational linguistics, 26(3), 339-373. (Wolf et al. 2004) Florian Wolf, Edward Gibson: Paragraph-, Word-, and Coherence-based Approaches to Sentence Ranking: A Comparison of Algorithm and Human Performance. ACL 2004, 383-390 55
© Copyright 2026 Paperzz