On Frequent Chatters Mining Claudio Lucchese 1st HPC Lab Workshop 6/15/12 1st HPC Workshp - Claudio Lucchese Frequent Patterns Mining • How may patterns do you see in the following dataset ? A B C D E F G H I J K L M 1 2 3 4 5 6 7 8 9 10 11 12 13 Claudio Lucchese, Salvatore Orlando, Raffaele Perego: Mining Top-K Patterns from Binary Datasets in Presence of Noise. SDM 2010 1st HPC Workshp - Claudio Lucchese 6/15/12 Frequent Patterns Mining A B C D E F G H I J K L M 1 2 3 4 5 6 7 8 9 10 11 12 13 6/15/12 1st HPC Workshp - Claudio Lucchese Frequent Patterns Mining usually rows and cols are not in “good-looking” order 6/15/12 1st HPC Workshp - Claudio Lucchese State of the art • Most recent approaches try to discover the top-k patterns that optimize different cost functions: • Minimize Noise (“holes”) or • Minimize MDL • encoding(Patterns) + encoding(Data|Patterns) • Maximize Information Ratio: • Number of bits of information w.r.t. to the Maximum Entropy Model built on the basis of rows and cols marginal distribution • Minimize length of patterns and the amount of noise (our approach =) 6/15/12 1st HPC Workshp - Claudio Lucchese Evaluation • Unsupervised: • Measure how well the proposed algorithm optimizes the proposed cost function • What is the best cost function ? • We are investigating supervised measures: • Unsupervised extraction: extract patterns from classification/clustering dataset without class/cluster labels information • Supervised evaluation: measure how well the patterns can predict/match classes/clusters • Preliminary result: 6/15/12 • Fancy cost functions might not be the best ones 1st HPC Workshp - Claudio Lucchese Information Overload in News Gianmarco De Francisci Morales, Aristides Gionis, Claudio Lucchese: From chatter to headlines: harnessing the real-time web for personalized news recommendation. WSDM 2012. 6/15/12 1st HPC Workshp - Claudio Lucchese Can we exploit Twitter? ✓ Timeliness ✓ Personalization Number of mentions of “Osama Bin Laden” 6/15/12 1st HPC Workshp - Claudio Lucchese News Get Old Soon • 90% of the clicks happen within 2 days from publication • Only a few occur early! 6/15/12 1st HPC Workshp - Claudio Lucchese T.Rex (Twitter-based news recommendation system) • Builds a user model from Twitter • Signals from user generated content, social neighbors and popularity across Twitter and news • Entity-based representation (overcomes vocabulary mismatch) • Learn a personalized news ranking function: • Pick up candidates from a pool of related or popular fresh news, rank them and present top-k to the user 6/15/12 1st HPC Workshp - Claudio Lucchese Recommendation Model • Ranking function is user and time dependent • Social model + Content model + Popularity model • Popularity model tracks entity popularity by the number of mentions in Twitter and news (with exponential forgetting) • Content model measures relatedness of a bag-of-entities representation of a users’ tweet stream and of a news article • Social model weights the content model of every social neighbor by a truncated PageRank on the Twitter network 6/15/12 1st HPC Workshp - Claudio Lucchese System Overview ✓ Designed to be streaming and lightweight (just counting) ✓ User model is updated continuously 6/15/12 1st HPC Workshp - Claudio Lucchese Learning the Weights • Learning to rank approach with SVM • Each time the user clicks on a news, we learn a set of preferences (clicked_news > non_clicked_news): • Prune the number of constraints for scalability: • only news published in the last 2 days • only take the top-k news for each ranking component • Can optionally include additional features for news articles: • click count, age, etc... (T.Rex+) 6/15/12 1st HPC Workshp - Claudio Lucchese Predicting Clicked News ✓ User generated content is a very good predictor albeit very sparse ✓ Click Count is a strong baseline but does not help T.Rex+ 6/15/12 1st HPC Workshp - Claudio Lucchese Predicting Clicked Entities 6/15/12 1st HPC Workshp - Claudio Lucchese Future works (?) • Explain a set of news showing how the main topics interacted with each other over time. 6/15/12 1st HPC Workshp - Claudio Lucchese Future works (?) • Explain a set of news showing how the main topics interacted with each other over time. • Example: European sovereign-debt crisis EuroBond New Italian government Loan Fiscal Compact Obama Berlusconi Monti Merkel EU France tim e 6/15/12 Greece 1st HPC Workshp - Claudio Lucchese Future works (?) • Explain a set of news showing how the main topics interacted with each other over time. • Applications: • • • • 6/15/12 Given the news the user is currently reading, provide an explanation of the related facts that precede that news Given a query, provide an explanation of the documents related to that query Given a set of topics, explain their relations over time Browse a collection of news, by changing the topics of interest, the time window, the granularity 1st HPC Workshp - Claudio Lucchese Future works (?) • Explain a set of news showing how the main topics interacted with each other over time. • A topic is a named entity relevant over time • An interaction is a cluster of news related to some event and relevant in a small time window • It might be important to cover the given time window, but recent events might be more interesting 6/15/12 1st HPC Workshp - Claudio Lucchese Future works (?) • Explain a set of news showing how the main topics interacted with each other over time. • Given a maximum number of main topics and interactions, maximize: • • • • 6/15/12 Topic coverage and diversity Events time coverage Cluster similarity Main topics connectivity 1st HPC Workshp - Claudio Lucchese Future works (?) • Explain a set of news showing how the main topics interacted with each other over time. • Its is different from news clustering: • Even if you had a good clustering, might not be trivial to select which events and which topics to show in order to maximize the amount of information delivered to the user • There is some interesting related work • aimed at finding chains of news, we are more interested in topic evolution 6/15/12 1st HPC Workshp - Claudio Lucchese Thank you ! 6/15/12 1st HPC Workshp - Claudio Lucchese
© Copyright 2026 Paperzz