EVENT IDENTIFICATION IN SOCIAL MEDIA Hila Becker, Luis Gravano Columbia University Mor Naaman Rutgers University Social Media Sites Host Many “Event” Documents 2 “Event”= something that occurs at a certain time in a certain place [Yang et al. ’99] Popular, widely known events Presidential Inauguration, Thanksgiving Day Parade Smaller events, without traditional news coverage Local food drive, street fair … Photo-sharing: Flickr Video-sharing: YouTube Social networking: Facebook Social media documents for “All Points West” festival, Liberty State Park, New Jersey, 8/8/08 Identifying Events and Associated Social Media Documents 3 Applications Event search and browsing Local search … General approach: group similar documents via clustering Each cluster corresponds to one event and its associated social media documents Event Identification: Challenges 4 Uneven data quality Missing, short, uninformative text … but revealing structured context available: tags, date/time, geo-coordinates Scalability Dynamic data stream of event information Unknown number of events Necessary for many clustering algorithms Difficult to estimate Clustering Social Media Documents 5 Social media document representation Social media document similarity Social media document clustering Clustering task: definition Ensemble algorithm: combining multiple clustering results Preliminary evaluation Social Media Document Representation 6 Title Description Tags Date/Time Location All-Text Social Media Document Similarity 7 Title Title Description Description Tags Text: tf-idf weights, cosine similarity A A A Date/TimeTags Keywords B B B Time: proximity in minutes time LocationDate/Time Keywords Date/TimeLocation Proximity LocationProximity All-Text All-Text Location: geo-coordinate proximity Social Media Document Clustering Framework 8 Social media documents Document feature representation Event clusters Clustering: Ensemble Algorithm 9 Ctitle Wtitle Ctag Wtags f(C,W) s Wtime Ctime Consensus Function: combine ensemble similarities Learned in a training step Ensemble clustering solution Clustering: Measuring Quality 10 Homogeneous clusters ✔ Complete clusters ✔ Metric: Normalized Mutual Information (NMI) Shared information between clustering solution and “ground truth” Experimental Setup 11 Data: >270K Flickr photos Event labels from Yahoo!’s “upcoming” event database Split into 3 parts for training/validation/testing Clusterers: single pass algorithm with centroid similarity Weighing scheme: Normalized Mutual Information (NMI) scores on validation set Consensus function: weighted average of clusterers’ binary predictions Final prediction step: single pass clustering algorithm Preliminary Evaluation Results 12 Individual clusterer performance Highest NMI: Tags, All-Text Lowest NMI: Description, Title Ensemble performance, compared against all individual clusterers Highest overall performance in terms of NMI More homogenous clusters: each event is spread over fewer clusters Details in paper Future Work: Alternative Choices 13 Document similarity metric Ensemble approach Weight assignment Choice of clusterers Train a classifier to predict document similarity Features correspond to similarity scores All-text, title, tags, time, location, etc. Numeric values in [0,1] State-of-the-art classifiers: SVM, Logistic Regression, … Future Work: Alternative Choices 14 Final clustering step Apply graph partitioning algorithms Requires estimating the number of clusters Evaluation metrics: beyond NMI Datasets Flickr LastFM, YouTube Exploit social network connections Conclusions 15 Identified events and their corresponding social media documents Proposed a clustering solution Leveraged different representations of social media documents Employed various social media similarity metrics Developed a weighted ensemble clustering approach Reported preliminary results of our event identification approach on a large-scale dataset of Flickr photographs
© Copyright 2025 Paperzz