over time

On Frequent
Chatters Mining
Claudio Lucchese
1st HPC Lab Workshop
6/15/12
1st HPC Workshp - Claudio Lucchese
Frequent Patterns Mining
• How may patterns do you see in the following dataset ?
A
B
C D
E
F G H
I
J
K
L
M
1
2
3
4
5
6
7
8
9
10
11
12
13
Claudio Lucchese, Salvatore Orlando, Raffaele Perego: Mining Top-K Patterns from Binary Datasets in Presence of Noise.
SDM 2010
1st HPC Workshp - Claudio Lucchese
6/15/12
Frequent Patterns Mining
A
B
C D
E
F G H
I
J
K
L
M
1
2
3
4
5
6
7
8
9
10
11
12
13
6/15/12
1st HPC Workshp - Claudio Lucchese
Frequent Patterns Mining
usually rows and cols are not in “good-looking” order
6/15/12
1st HPC Workshp - Claudio Lucchese
State of the art
• Most recent approaches try to discover the top-k patterns that
optimize different cost functions:
• Minimize Noise (“holes”) or
• Minimize MDL
• encoding(Patterns) + encoding(Data|Patterns)
• Maximize Information Ratio:
• Number of bits of information w.r.t. to the Maximum Entropy Model
built on the basis of rows and cols marginal distribution
• Minimize length of patterns and the amount of noise
(our approach =)
6/15/12
1st HPC Workshp - Claudio Lucchese
Evaluation
• Unsupervised:
• Measure how well the proposed algorithm optimizes the proposed
cost function
• What is the best cost function ?
• We are investigating supervised measures:
• Unsupervised extraction: extract patterns from
classification/clustering dataset without class/cluster labels
information
• Supervised evaluation: measure how well the patterns can
predict/match classes/clusters
• Preliminary result:
6/15/12
• Fancy cost functions might not be the best ones
1st HPC Workshp - Claudio Lucchese
Information Overload in News
Gianmarco De Francisci Morales, Aristides Gionis, Claudio Lucchese: From chatter to headlines:
harnessing the real-time web for personalized news recommendation. WSDM 2012.
6/15/12
1st HPC Workshp - Claudio Lucchese
Can we exploit Twitter?
✓ Timeliness
✓ Personalization
Number of
mentions of
“Osama Bin
Laden”
6/15/12
1st HPC Workshp - Claudio Lucchese
News Get Old Soon
•
90% of the clicks
happen within 2
days from
publication
•
Only a few occur
early!
6/15/12
1st HPC Workshp - Claudio Lucchese
T.Rex (Twitter-based news
recommendation system)
• Builds a user model from Twitter
• Signals from user generated content, social neighbors and popularity
across Twitter and news
• Entity-based representation (overcomes vocabulary mismatch)
• Learn a personalized news ranking function:
•
Pick up candidates from a pool of related or popular fresh news,
rank them and present top-k to the user
6/15/12
1st HPC Workshp - Claudio Lucchese
Recommendation Model
• Ranking function is user and time dependent
• Social model + Content model + Popularity model
• Popularity model tracks entity popularity by the number of
mentions in Twitter and news (with exponential forgetting)
•
Content model measures relatedness of a bag-of-entities
representation of a users’ tweet stream and of a news article
•
Social model weights the content model of every social
neighbor by a truncated PageRank on the Twitter network
6/15/12
1st HPC Workshp - Claudio Lucchese
System Overview
✓ Designed to be streaming and lightweight (just counting)
✓ User model is updated continuously
6/15/12
1st HPC Workshp - Claudio Lucchese
Learning the Weights
• Learning to rank approach with SVM
• Each time the user clicks on a news, we learn a set of
preferences (clicked_news > non_clicked_news):
• Prune the number of constraints for scalability:
• only news published in the last 2 days
• only take the top-k news for each ranking component
• Can optionally include additional features for news articles:
• click count, age, etc... (T.Rex+)
6/15/12
1st HPC Workshp - Claudio Lucchese
Predicting Clicked News
✓ User generated content is a very good predictor albeit very sparse
✓ Click Count is a strong baseline but does not help T.Rex+
6/15/12
1st HPC Workshp - Claudio Lucchese
Predicting Clicked Entities
6/15/12
1st HPC Workshp - Claudio Lucchese
Future works (?)
• Explain a set of news showing how the main topics
interacted with each other over time.
6/15/12
1st HPC Workshp - Claudio Lucchese
Future works (?)
• Explain a set of news showing how the main topics
interacted with each other over time.
• Example: European sovereign-debt crisis
EuroBond
New Italian
government
Loan
Fiscal
Compact
Obama
Berlusconi
Monti
Merkel
EU
France
tim
e
6/15/12
Greece
1st HPC Workshp - Claudio Lucchese
Future works (?)
• Explain a set of news showing how the main topics
interacted with each other over time.
• Applications:
•
•
•
•
6/15/12
Given the news the user is currently reading, provide an
explanation of the related facts that precede that news
Given a query, provide an explanation of the documents
related to that query
Given a set of topics, explain their relations over time
Browse a collection of news, by changing the topics of
interest, the time window, the granularity
1st HPC Workshp - Claudio Lucchese
Future works (?)
• Explain a set of news showing how the main topics
interacted with each other over time.
• A topic is a named entity relevant over time
• An interaction is a cluster of news related to some event
and relevant in a small time window
• It might be important to cover the given time window,
but recent events might be more interesting
6/15/12
1st HPC Workshp - Claudio Lucchese
Future works (?)
• Explain a set of news showing how the main topics
interacted with each other over time.
• Given a maximum number of main topics and interactions,
maximize:
•
•
•
•
6/15/12
Topic coverage and diversity
Events time coverage
Cluster similarity
Main topics connectivity
1st HPC Workshp - Claudio Lucchese
Future works (?)
• Explain a set of news showing how the main topics
interacted with each other over time.
• Its is different from news clustering:
• Even if you had a good clustering, might not be trivial to
select which events and which topics to show in order to
maximize the amount of information delivered to the user
• There is some interesting related work
• aimed at finding chains of news,
we are more interested in topic evolution
6/15/12
1st HPC Workshp - Claudio Lucchese
Thank you !
6/15/12
1st HPC Workshp - Claudio Lucchese