Detecting Comment Spam through Content Analysis

Detecting Comment Spam through Content
Analysis
Congrui Huang, Qiancheng Jiang, Yan Zhang
Key Laboratory of Machine Perception, Ministry of Education
School of Electronics Engineering and Computer Science, Peking University Beijing
[email protected], {jiangqiancheng,zhy}@cis.pku.edu.cn
Abstract. In the Web 2.0 eras, the individual Internet users can also act
as information providers, releasing information or making comments conveniently. However, some participants may spread irresponsible remarks
or express irrelevant comments for commercial interests. This kind of socalled comment spam severely hurts the information quality. This paper
tries to automatically detect comment spam through content analysis,
using some previously-undescribed features. Experiments on a real data
set show that our combined heuristics can correctly identify comment
spam with high precision(90.4%) and recall(84.5%). Keywords: Comment Spam, Content Analysis, Comment Features.
1
Introduction
From its original intention as a platform for sharing data, the web has grown to
be a central part of cultural, educational and commercial life. On the internet,
people could purchase goods, keep an eye on people around and exchange ideas
at any time. The principle of human centeredness helped bring about the idea
of social network service, which was first raised by Starr [1]. Social network
service provides a platform for people to share their interests, or to dig topics
and activities they concerned. The usage of social network opens up the virtual
and real life: improving communication, presenting business opportunities and
advancing technology. So, improving the quality of social networking is well
worth considering. Blog, as a kind of social networking, has been widely used in
recent years. In this paper, we will propose a method to perfect the quality of
the blog.
As blog sites rely on user-generated content, making them both incredibly
dynamic and tempting targets for spam [2]. Spam, a cause of low-quality social
networking, not only brings bad impression on users experience, but also blinds
the search engine. Comments, the main part of a blog, is a good indication for the
significance of the weblog [3], most bloggers also identify comment feedback as
an important motivation for their writing [4, 5]. In this scenario, comment spam
hurts both bloggers and users. Blog publishers provide comments so that their
thoughts can be easily shared, but once you allow readers to add comments,
you also provide a shortcut for spammers to abuse the comment system. To
solve the problem raised by being open, we cannot turn off comment system
2
radically. Although filtering comment spam manually may be effective, it is an
exhausting and unrealistic method. Practitioners and researchers have tried to
find an efficient and automatic way to combat comment spam. In the last few
years, research in this respect has led to good results. In this paper, we proposed
a heuristic method on the basis of predecessors works.
We noticed that Trevino [5] and Mishne et al. [6] mainly focused on link spam
common in blog comments. Mishne et al. has followed a language modelling
approach for detecting link spam in blogs and similar pages. KL-divergence, the
main method to solve their problem, is barely one part of our work. Whats more,
link spam is one of the manifestations of comment spam. Sculley [7] summarized
that academic researchers advocated the use of SVMs, while practitioners prefer
Bayesian methods for content-based filtering. Sculley also showed his Relaxed
Online SVM to reduce computational cost.
Creating an efficient way to detect comment spam is a challenging problem.
We must ensure that we do not mistakenly consider legitimate comments to be
spam. In this paper, we propose a fresh approach to filter comment spam. We
consider extracting several features of comments. By analyzing the combined
features, we get relative rules to filter blog comments. In detail, we propose
several features, according to the distribution of features, we apply heuristic
methods to obtain judgment rules of blog comments for each web site, with
which we check comments step by step. At last, we get comment spam set and
the normal comments set.
Compared with traditional method, we do not apply classification algorithms
directly, what we are faced with is the real and complicated network environment.
With features we proposed, we built our own heuristic model, not the same as
the classical ones. When we apply our model in popular blog sites, it removes
comment spam effectively, certifying that our method is feasible indeed.
The remainder of our paper is structured as follows: In section 2, we discuss
some related work. In section 3, we define comment spam. In section 4 and 5 we
analyze comment features and describe our heuristic methods. In section 6 we
show our experimental framework and the real-world data set that we used.
2
Related Work
Academic research directed towards comment spam is relative rare, however, we
could have a look at some existing solutions. Spammers will reply to their own
comments, while normal users will not [8], so comment spam can be reduced
by disallowing multiple consecutive submissions. Google is advocating the use of
rel=”nofollow” [9] in order to reduce the effect of heavy inter-blog linking on page
ranking. Special software products such as Math Comment Spam Protection
Plugin for Wordpress [10] and Movable Type [11] have their own ways to prevent
comment spam.
A survey [2] showed that the three main anti-spam strategies commonly used
in practice are: Identification-based, Rank-based and Interface or Limit-based.
It mentioned that the third method has been used to prevent comment spam.
3
Adam Thomason [12] has reviewed the current state of spam in the blogosphere,
concluding that anti-spam solutions developed for emails are effective in detecting blog.
Also, Gordon et al. [13] has considered the problem of content-based spam
filtering that arises in three contexts, blog comments involved. Their experiments
are conducted to evaluate the effectiveness of state-of-the-art email spam filters,
without modification. Further experiments are conducted to investigate the effect
of lexical feature expansion. Detection of Harassment on Web2.0 [14] employs
content features, sentiment features and contextual features of documents, using
a supervised learning approach to identify online harassment, including comment
spam.
We believe that comment features we propose in this paper will do favor for
other researchers of this field. We have supplied a simple, but more practical
method to filter comment spam. Our work will throw new light on comment
spam studies.
3
Comment Spam
Normally, comments cannot exist independently, as should be attached to the
body of the blog. Some related concepts should be declared in order to define
comment spam.
Definition 1. Spam Info. If comment s contains publicity information or advertisement, generally refers to hyperlink, e-mail, phone number, MSN and so
on. We call s Spam Info, and use Spam Info(s) to measure it.
Definition 2. Correlation. If comment c discusses something about page p, then
the Correlation between c and p will be marked as Rel(C, P). Generally speaking,
Rel(C, P) can be indicated by similarity, that is Rel(C, P) = Sim(C, P).
Usually we use two methods below to compute the similarity between c and
p. Cosine Similarity, we get term-frequency vectors C and P of comment c and
page p, then the cosine similarity, Sim(C,P), is represented using a dot product
and magnitude:
C·P
(1)
Rel(C, P ) =
|C| · |P |
KL-Divergence, first introduced by Kullback and Leibler in 1951 [15]. In
information theory, it has been used to indicate the divergence between two
distributions. For probability distributions C and P of a discrete random variable,
the KL-Divergence of P from C is defined to be:
DKL (C||P ) =
X
i
C(i) log
C(i)
P (i) + ε
(2)
Rel(C, P) will be calculated as:
Rel(C, P ) = 1 − DKL (C||P )
(3)
4
Unlike cosine similarity, KL-Divergence does not satisfy symmetry and triangle inequality, C and P are two different distribution densities of random variable
χ. KL-Divergence measures the difference between C and P. As for our problem,
we would like to know how far each comment is from the blog text.χ indicates
words in each comment. In (2), χ is referred to i, C(i) and P(i) represent the
number of times the wordappears in each corresponding text. We have to claim
that C and P in (3) do not represent term-frequency we used for (1). C indicates
the distribution of each comment, the probability distribution of the blog body
corresponds to P. As not every word in the comment will exist in the blog body,
we need to add a small constant to deal with a zero denominator, in (2), the
constant is ε,in our experiment, is greater than 0 and less than 0.5.Facts show
that KL-Divergence is a good method to compute relevance in our model. Thus,
with definition 1 and 2, we have the definition below.
Definition 3. Comment Spam. If comment c satisfies Spam Info(c)¿ . Rel(C,
P) ¡, then c is considered to be comment spam.
4
Content-based Spam Detection
Bhattarai et al. [16] have analyzed several features of spam content. Based on
their work, we selected 2,646 blogs randomly as the training set and labeled
the data manually, as a result, we got 277 spams. According to the feature
distribution of the set, we proposed some features to distinguish the normal
comment from comment spam. Features used in our model will be discussed
here.
4.1
The Length of the Comment
Length is an apparent feature of a comment, so we investigated whether length
is a good indicator of spam. To this end, we plotted length distribution for each
comment, the result is showed in Fig. 1. This figure consists of a bar graph and a
line graph. The bar graph depicts the distribution of a certain interval of length
of all comments in our training set. The horizontal axis depicts a set of value
ranges. The left scale of the vertical axis applies to the bar graph, and depicts the
number of comments in training set that fell into a particular range. The right
scale of the vertical axis applies to the line graph, and depicts the percentage of
sampled comments in each range that have been judged to be spam.
As can be observed in Fig. 1, short comments hold a high percentage. In
general, the number of comments declines, while the possibility of appearance
of spam rises, as the comment length increases. This supports the intuition that
in the real world, normal comments are always short and to the point, while
spammers will repeat its propaganda or discuss a topic not related to the text in
detail. Obviously, such a feature cannot indicate comment spam alone. However,
we will be able to identify a spam by combining other features, for example, a
long comment with low relevance to the blog body is more likely to be a spam
compared with a short one.
5
4.2
Similarity
A regular comment is always relevant to the body of the blog, except some short
ones. Thus, text similarity will do favor to filter comment spam. However, short
normal comments may have low cosine similarity and long comment spam may
have high cosine similarity, which is also the major drawback of measuring comment spam by computing text similarity. Like Fig. 1, Fig. 2 also has a bar graph
and a line graph. Comments collect in area with low similarity. A commentator
will not write long comments, also it will not comment on the blog with the
original text of the blog content. Everyone may use its own words to represent
its opinion. As text similarity is a statistical method, similarity between comment and the blog text is certainly not very high. When similarity is greater
than 0.28, comments start to shrink and comment spam do vanish.
Fig. 1. Relationship between comment spam and comment length.
Fig. 2. Relationship between comment spam and similarity.
4.3
KL Divergence
KL Divergence, a statistical method, emphasizing the difference between the
texts, has a good effect on exploring the text difference. Short comments may
6
have low similarity but low divergence comparatively speaking. Long comment
spam may have high similarity but high divergence relatively speaking. Thats
why we use this feature as a complement to the method of cosine similarity.
Compared with Fig. 2, comments collect in area with low divergence, indicating
Fig. 3. Relationship between comment spam and KL-Divergence.
that comments related to the blog content account for the majority. When divergence becomes high, the possibility of being comment spam goes up. Still we
cannot identify comment spam by divergence alone.
4.4
Popular Words Ratio and Propaganda
Generally speaking, comment spam always contains propagandistic information
to propagate the commentators website or its business. So whether a comment
contains URL, phone number, E-mail address, MSN number, or whatever, will
be a good cue for a spam. Fig. 4 shows the distribution of comments on Propaganda.Some short comments, such as bravo, marvelous and other common
comment terms, though meaningless, should not be classified as comment spam
actually. Hence, we collect a common glossary to check popular words ratio of
a comment to identify such short and low correlation comments. Popular words
ratio is used to measure the proportion of those words we mentioned above in a
comment. Fig. 5 shows the distribution of comments on popular words ratio.
Clearly, when a comment contains a lot of propagandas, it must be comment
spam. Normal comments will not contain propagandas or contain little. Fig. 4
indicates that propaganda is a feature with high discriminability. Through the
distribution of popular words ratio, we notice that when a comment only contains
popular words it will not be considered to be a spam. Spams stress on area where
the ratio of popular words is low. Here we may discuss a kind of comment spam,
from Fig. 5, we find that there is still comment spam even when the ratio is
greater than 0.4. Some spammers may use some popular words to disguise their
comments. Behind these words, they always put the propagandistic information.
This very fact underscores that we should combine features mentioned above to
identify comment spam.
7
Fig. 4. Distribution of Propagandas
5
Fig. 5. Distribution of Popular words Ratio
Using Classifiers to Combine Heuristics
How can we identify comment spam with features mentioned above? Each feature
should not be considered alone, the combination of these features would access
to our purpose. Our purpose is to classify comments, so we should take some
classification algorithms into account. We build 3 simple models on the training
set to see which method is better. Results are showed in Table 1.
Table 1. Classification Model Comparision
Models
SVM
Naive Bayes
DecisionT ree(c4.5)
Precision
67.1%
70.9%
84.7%
Recall
30.7%
46.0%
44.9%
Due to the unadjusted parameters, these models do not exhibit their best
performance, especially the values of recall. However, we can see that decision
tree model performs better and thus it is selected and enhanced as our final
solution. A decision tree is a decision support tool that uses a tree-like graph
or model of decisions and their possible consequences [17]. Actually, identifying
comment spam is a decision process. Each feature can decide whether a comment
is a spam, with the decision of one feature we get new classification results of
comments, finally all features will make their decisions to obtain the ultimate
results whether a comment can be ranked as a spam.
We are encouraged to employ a heuristic decision method to build a treemodel similar to decision tree. In simple terms, after we get features of each
comment, we apply statistical method to measure a single features ability to
classify the comment. The best feature will be chosen as the root of a tree. Then
every possible decision of the root will be treated as its son. The corresponding
comments will be put under the proper branch. Repeat the process; choose the
current best feature from the comment set embedded in branch nodes.
The best feature should have high resolution. For mass behavior is credible, when a comments value of a feature closes to the integral level, it will be
ranked as a normal one, else further decisions are needed. We examine skewness coefficient1 of the distribution on each feature of all comments to select
8
the best feature. Symmetrical distribution always implies lower discriminability
compared with long-tailed distribution. Thus, the bigger the absolute value of
skewness coefficient is, the better the feature is.
We may take the mean value or mode of the distribution as a threshold.
Note that there is big difference in the style and participants of a blog site in
real network. So strictly speaking, neither best features nor threshold is the same
for different blog sites, in other words, a uniform model cannot receive precise
results.
Fig. 6 is a typical shape of the core part of our model. In Fig. 6 the best
feature is propaganda as containing propaganda shows a great potential of comment spam existing. Continually, similarity and divergence of propagandistic
comments are computed to minimize misjudgment. Comments without propagandistic information will be divided into three classes. Short comments with
low popular words ratio should be considered to be comment spam. For long
comments, we only use divergence to decide whether they are comment spam to
avoid the noise of the long text. Threshold selection will be discussed in experimental section.
Fig. 6. Core Part of the heuristic model.
6
6.1
Experiments and Results
Data Collection
Nobody can accomplish anything without the necessary means, however, large
benchmark data sets of labeled blog comment spam do not yet exist [8].Thus,
we have to run our experiment on the only publicly available blog sites.
Fig. 7 displays how we get our dataset. Our data mainly comes from popular
blog websites: Sohu, Sina and Baidu, from which we get many representative
comments. We select some seed sites , for each seed site, we apply fixed URL
9
Fig. 7. Data Collection
pattern to obtain the URL list of blogs. Finally, we pick out such data that the
length of the blog body is greater than 500 and the quantity of the comment
surpass 25 from the crawling pages. The description of the crawling data will be
found in Table 2. The crawling data is our testing set, which is different from
our training set mentioned in section 4 and 5. As lack of hand-crafted training
data, we have to label the raw corpus manually to form our relatively small scale
training set.
Table 2. Statistical Data of Comments
Blogsites Sumofarticles Avgofartilcleslength Sumofcomments Avgofcomments
Baidu
298
545.3
11,563
38.8
Sina
359
845.3
10,518
29.3
583
784.2
31,908
54.7
Sohu
Total
1240
741.3
53,989
43.5
6.2
Distribution of Comment Feature
Fig. 8 shows the distribution of comments length on each site. Assuming that
most users will release valid comment, we come to this conclusion that the possibility of a comment being a spam increased by its distance from the center which
is represented by threshold. So our threshold will exceed the mean value and the
mode within a certain range. From Fig. 9, we can see that the distribution of
propagandas is a discrete long-tail distribution. Popular words ratio gives focus
on some short comment, we have collected 20 words from the blog comments as
popular words. Fig. 10 indicates the distribution of this feature. A number of
comments contain popular words. Thus this feature can help us differentiate the
meaningless comment from comment spam well. Finally, we analyze the similarity and divergence between the comment and the blog body. These two aspects
can depict the relationship between the comment and the body accurately. Fig.
11 and Fig. 12 are these two distributions. We find that distributions of various
sites are consistent. According to the distribution, we could define the similarity
threshold for our model. Divergence treats the comment and the body as two
language model. The divergence value distributes uniformly, mainly concentrates
in the area of [-1,0].
10
Fig. 8. Distribution of Comments Length on each site
Fig. 9. Distribution of Propagandas
Fig. 10. Distribution of Popular words Ratio
Fig. 11. Distribution of Similarity
Fig. 12. Distribution of KL Divergence
11
6.3
Results and Evaluation
After analyzing the distribution of each feature, we build our heuristic model
successfully. We get results in Table 3 by processing the blog comment with our
model. In general, comment spam makes up about 20% of the total comments.
In our model, Sohu blogs are mostly current affairs which would gain more
attentions, as a result, the Sohu site has a high proportion of comment spam.
Table 3. Statistical Data of Comments
Blogsites Sumofarticles Sumofcomments Sumofcomment spam Proportion
Baidu
298
11,563
1,108
9.58%
Sina
359
10,518
3,354
31.88%
Sohu
583
31,908
5,852
18.34%
Total
1240
53,989
10,314
19.1%
To evaluate our model, we have to compute precision and recall of the results. Precision is an important performance index of a model, it indicates the
effectiveness of our method and the credibility of our model. Precision P will be
calculated as:
Cactual
(4)
P =
Cf ind
Cactual is the amount of actual comment spam in our result. Cf ind is the amount
of comment spam that our model has found.
Recall evaluates the identification ability of our model. While ensuring precision, higher recall shows the ability of our model to identify comment spam.
So enhance recall is also an important part in our experiment. Recall R will be
calculated as:
Cactual
R=
(5)
Ctotal
Ctotal is the amount of comment spam in our corpus.
Table 4. Statistical Data of Comments
BlogSites
Baidu
Sina
Sohu
total
Precision
92.6%(463/500)
91.4%(457/500)
87.2%(436/500)
90.4%(1,356/1500)
Recall
86.4%(432/500)
83.0%(415/500)
84.2%(421/500)
84.5%(1,268/1500)
We apply sample annotation to reduce the workload. For precision, we select 500 comments from the results randomly, and then mark comment spam
manually by some measures3. With the number of the marked comments, we
will get the precision. To get the recall, we extract comment spam with our
12
heuristic model from a total of 500 spams that been marked in the whole comment set. With the above methods, we will get the precision and recall of the
three different sites. From Table 4 we can see that the precision of our model
is fundamentally satisfied. However, various comment spam have different forms
of expression, which causes the low recall of our model, suggesting that finding
more comment spam should be the focus of our future work.
7
Conclusions
This research deals with comment spam. Initially we defined comment spam,
simultaneously, we proposed some features and analyzed these features of comments with statistical methods. Statistics show that comment spam would be
filtered out by combining these features. Experiments show that our heuristic
model can find comment spam effectively with high precision and recall. As far
as we concerned, research on comment spam is still a novel topic at present. As
a significant attempt, our method has acquired satisfied effects. In future, we
should try to build determination model more reasonably to enhance recall. On
the other hand, we could apply more knowledge and technology of other fields,
such as Nature Language Processing, to combat comment spam more aggressively.