DETECTING SPAMMERS AND CONTENT PROMOTERS IN

DETECTING SPAMMERS AND
CONTENT PROMOTERS IN ONLINE
VIDEO SOCIAL NETWORKS
Fabrício Benevenuto∗, Tiago Rodrigues, Virgílio
Almeida, Jussara Almeida, and Marcos Gonçalves
Computer Science Department, Federal
University of Minas Gerais Belo Horizonte, Brazil
(SIR’09)
Speaker : Yi-Ling Tai
Date : 2009/09/28
OUTLINE
Introduction
 User test collection
 Analyzing user behavior attributes
 Detecting spammer and promoters






Evaluation metrics
Experimental setup
Classification
Reducing attribute set
Conclusions
INTRODUCTION
YouTube is the most popular Online video social
networks.
 It allows users to post a video as a response to a
discussion topic.
 These features open opportunities for users to
introduce polluted content into the system.
 Pollution –

spread advertise to generate sales
 disseminate pornography
 compromise system reputation

INTRODUCTION
Users cannot easily identify the pollution before
watching, which also consumes system resources,
especially bandwidth.
 This paper address the issue of detecting video
spammers and promoters.
 Spammers



post an unrelated video as response to a popular
video topic to increase the likelihood of the response
being viewed.
Promoters

post a large number of responses to boost the rank of
the video topic.
INTRODUCTION

Toward this end crawl a large user data set from YouTube.
 “manually” classified user as legitimate, spammers
and promoters.
 study attributes to distinguish different types of
polluters .
 use a supervised classification algorithm to detect
spammers and promoters.

USER TEST COLLECTION
A YouTube video is a responded video or a video
topic if it has at least one video response.
 A YouTube user is a responsive user if she has
posted at least one video response.
 A responded user is someone who posted at least
one responded video.
 Polluter is used to refer to either a spammer or a
promoter.

CRAWLING YOUTUBE
User interactions can be represented by a video
response user graph.
G = (X, Y)
 X is the union of all users who posted or received
video responses.
 (x1, x2) is a directed arc in Y, if user x1 has
responded to a video contributed by user x2


To build the graph, this paper build a crawler
that implements Algorithm 1.
CRAWLING YOUTUBE


The sampling starts from a set of 88 seeds,
consisting of the owners of the top-100 most
responded videos of all time.
The crawler follows links gathering information
on a number of different attributes.
264,460 users
 381,616 responded videos
 701,950 video responses

CRAWLING YOUTUBE
BUILDING A TEST COLLECTION
The main goal is to study the patterns and
characteristics of each class of users.
 The collection should include the properties

having a significant number of users of all three
categories
 including, but not restricting to large amounts of
pollution
 including a large number of legitimate users with
different behavior


randomly sampling may not achieve these
properties.
BUILDING A TEST COLLECTION

three strategies for user selection
different levels of interaction
1.


Four groups of users based on their in and out-degrees
100 users were randomly selected from each group
Aiming at the test collection with polluters
2.

Browsed responses of top 100 most responded videos,
selecting suspect users.
randomly selected 300 users
3.


Who posted video responses to the top 100 most
responded videos
To minimize a possible bias by strategy2
BUILDING A TEST COLLECTION



Each selected user was then manually classified.
Three volunteers analyzed all video responses of
each user to classify her into one of categories.
Volunteers were instructed to favor legitimate
users.
ANALYZING USER BEHAVIOR
ATTRIBUTES
We considered three attribute sets
 Video attributes

Duration, number of views, commentaries received
Rating, number of times to be selected favorite
Number of honor and external links
 Three video groups of the user

All video uploaded by the user
 Video responses
 Responded videos which this user response to


summing up 42 video attributes for each user
ANALYZING USER BEHAVIOR
ATTRIBUTES

User attributes

number of friends,
number of videos uploaded,
number of videos watched,
number of videos added as favorite,
numbers of video responses posted and received,
numbers of subscriptions and subscribers,
average time between video uploads,
maximum number of videos uploaded in 24 hours.
ANALYZING USER BEHAVIOR
ATTRIBUTES

Social network attributes

Clustering coefficient

cc(i), is the ratio of the number of existing edges between i’s
neighbors to the maximum possible number
Betweenness
 Reciprocity


Assortativity


The ratio between the node (in/out) degree and the average
(in/out)degree of its neighbors.
UserRank
ANALYZING USER BEHAVIOR
ATTRIBUTES

two well known feature selection methods.
Information gain

(Chi Squared)

ANALYZING USER BEHAVIOR
ATTRIBUTES
EVALUATION METRICS

use the standard information retrieval metrics
Recall
 Precision
 Micro-F1

first computing global precision and recall values for all
classes.
 then calculating F1


Macro-F1
first calculating F1 values for each class in isolation
 then averaging over all classes

EVALUATION METRICS

confusion matrix


EXPERIMENTAL SETUP

libSVM - an open source SVM package
allows searching for the best classifier parameters
using the training data
 provides a series of optimizations, including
normalization of all numerical attributes.

5-fold cross-validation.
 repeated 5 times with different seeds used to
shuffle the original data set.
 producing 25 different results for each test.

TWO CLASSIFICATION STRATEGIES

flat classification


promoters (P), spammers (S), and legitimate users (L)
hierarchical strategy
first separate promoters (P) from non-promoters (NP)
 heavy (HP) and light promoters (LP)
 legitimate users (L) and spammers (S)

FLAT CLASSIFICATION

confusion matrix obtained
The numbers presented are percentages relative
to the total number of users in each class.
 The diagonal indicates the recall in each class.
 no promoter was classified as legitimate user.
 3.87% 
videos actually acquired popularity.
 harder to distinguish them from spammers.

FLAT CLASSIFICATION

41.91% 
Legitimate users post their video responses to
popular responded videos(a typical behavior of
spammers).
Micro-F1 = 87.5, with per-class F1 values are
90.8, 63.7, and 92.3
 Macro-F1 = 82.2

HIERARCHICAL CLASSIfiCATION
Binary classification
 J parameter - one can give priority to one class
(e.g., spammers) over the other (e.g., legitimate
users) .


promoters VS non-promoters
Macro-F1 = 93.44
 Micro-F1 = 99.17

NON-PROMOTERS
We trained the classifier with the original
training data without promoters.
 with J=1

J = 0.1 - 24% VS 1%
 J = 3.0 - 71% VS 9%


The best solution
depends on the system
administrator’s objectives.
HEAVY AND LIGHT PROMOTERS

Aggressiveness



Maximum number of video responses posted in a 24hour period.
k-means clustering algorithm was used to
separate promoters into two clusters.
Average aggressiveness
Light promoters = 15.78 (CV=0.63)
 Heavy promoters = 107.54 (CV=0.61)

HEAVY AND LIGHT PROMOTERS

Binary classification

retrained with the original training data containing
only promoters
REDUCING THE ATTRIBUTE SET

Two scenarios Decreasing order of position in the χ2 ranking
 Evaluating classification when subsets of 10
attributes occupying contiguous positions

CONCLUSIONS
An effective solution to detect spammers and
promoters in online video social networks.
 Flat classification approach provides alternative to
simply considering all users as legitimate.
 Hierarchical approach explores different
classification tradeoffs and provides more flexibility
for the application.
 Finally, we can produce significant benefits with
only a small subset of less expensive attributes.
 Spammers and promoters will evolve and adapt to
anti-pollution strategies, periodical assessment of
the classification process may be necessary.
