Detecting Comment Spam through Content Analysis Congrui Huang, Qiancheng Jiang, Yan Zhang Key Laboratory of Machine Perception, Ministry of Education School of Electronics Engineering and Computer Science, Peking University Beijing [email protected], {jiangqiancheng,zhy}@cis.pku.edu.cn Abstract. In the Web 2.0 eras, the individual Internet users can also act as information providers, releasing information or making comments conveniently. However, some participants may spread irresponsible remarks or express irrelevant comments for commercial interests. This kind of socalled comment spam severely hurts the information quality. This paper tries to automatically detect comment spam through content analysis, using some previously-undescribed features. Experiments on a real data set show that our combined heuristics can correctly identify comment spam with high precision(90.4%) and recall(84.5%). Keywords: Comment Spam, Content Analysis, Comment Features. 1 Introduction From its original intention as a platform for sharing data, the web has grown to be a central part of cultural, educational and commercial life. On the internet, people could purchase goods, keep an eye on people around and exchange ideas at any time. The principle of human centeredness helped bring about the idea of social network service, which was first raised by Starr [1]. Social network service provides a platform for people to share their interests, or to dig topics and activities they concerned. The usage of social network opens up the virtual and real life: improving communication, presenting business opportunities and advancing technology. So, improving the quality of social networking is well worth considering. Blog, as a kind of social networking, has been widely used in recent years. In this paper, we will propose a method to perfect the quality of the blog. As blog sites rely on user-generated content, making them both incredibly dynamic and tempting targets for spam [2]. Spam, a cause of low-quality social networking, not only brings bad impression on users experience, but also blinds the search engine. Comments, the main part of a blog, is a good indication for the significance of the weblog [3], most bloggers also identify comment feedback as an important motivation for their writing [4, 5]. In this scenario, comment spam hurts both bloggers and users. Blog publishers provide comments so that their thoughts can be easily shared, but once you allow readers to add comments, you also provide a shortcut for spammers to abuse the comment system. To solve the problem raised by being open, we cannot turn off comment system 2 radically. Although filtering comment spam manually may be effective, it is an exhausting and unrealistic method. Practitioners and researchers have tried to find an efficient and automatic way to combat comment spam. In the last few years, research in this respect has led to good results. In this paper, we proposed a heuristic method on the basis of predecessors works. We noticed that Trevino [5] and Mishne et al. [6] mainly focused on link spam common in blog comments. Mishne et al. has followed a language modelling approach for detecting link spam in blogs and similar pages. KL-divergence, the main method to solve their problem, is barely one part of our work. Whats more, link spam is one of the manifestations of comment spam. Sculley [7] summarized that academic researchers advocated the use of SVMs, while practitioners prefer Bayesian methods for content-based filtering. Sculley also showed his Relaxed Online SVM to reduce computational cost. Creating an efficient way to detect comment spam is a challenging problem. We must ensure that we do not mistakenly consider legitimate comments to be spam. In this paper, we propose a fresh approach to filter comment spam. We consider extracting several features of comments. By analyzing the combined features, we get relative rules to filter blog comments. In detail, we propose several features, according to the distribution of features, we apply heuristic methods to obtain judgment rules of blog comments for each web site, with which we check comments step by step. At last, we get comment spam set and the normal comments set. Compared with traditional method, we do not apply classification algorithms directly, what we are faced with is the real and complicated network environment. With features we proposed, we built our own heuristic model, not the same as the classical ones. When we apply our model in popular blog sites, it removes comment spam effectively, certifying that our method is feasible indeed. The remainder of our paper is structured as follows: In section 2, we discuss some related work. In section 3, we define comment spam. In section 4 and 5 we analyze comment features and describe our heuristic methods. In section 6 we show our experimental framework and the real-world data set that we used. 2 Related Work Academic research directed towards comment spam is relative rare, however, we could have a look at some existing solutions. Spammers will reply to their own comments, while normal users will not [8], so comment spam can be reduced by disallowing multiple consecutive submissions. Google is advocating the use of rel=”nofollow” [9] in order to reduce the effect of heavy inter-blog linking on page ranking. Special software products such as Math Comment Spam Protection Plugin for Wordpress [10] and Movable Type [11] have their own ways to prevent comment spam. A survey [2] showed that the three main anti-spam strategies commonly used in practice are: Identification-based, Rank-based and Interface or Limit-based. It mentioned that the third method has been used to prevent comment spam. 3 Adam Thomason [12] has reviewed the current state of spam in the blogosphere, concluding that anti-spam solutions developed for emails are effective in detecting blog. Also, Gordon et al. [13] has considered the problem of content-based spam filtering that arises in three contexts, blog comments involved. Their experiments are conducted to evaluate the effectiveness of state-of-the-art email spam filters, without modification. Further experiments are conducted to investigate the effect of lexical feature expansion. Detection of Harassment on Web2.0 [14] employs content features, sentiment features and contextual features of documents, using a supervised learning approach to identify online harassment, including comment spam. We believe that comment features we propose in this paper will do favor for other researchers of this field. We have supplied a simple, but more practical method to filter comment spam. Our work will throw new light on comment spam studies. 3 Comment Spam Normally, comments cannot exist independently, as should be attached to the body of the blog. Some related concepts should be declared in order to define comment spam. Definition 1. Spam Info. If comment s contains publicity information or advertisement, generally refers to hyperlink, e-mail, phone number, MSN and so on. We call s Spam Info, and use Spam Info(s) to measure it. Definition 2. Correlation. If comment c discusses something about page p, then the Correlation between c and p will be marked as Rel(C, P). Generally speaking, Rel(C, P) can be indicated by similarity, that is Rel(C, P) = Sim(C, P). Usually we use two methods below to compute the similarity between c and p. Cosine Similarity, we get term-frequency vectors C and P of comment c and page p, then the cosine similarity, Sim(C,P), is represented using a dot product and magnitude: C·P (1) Rel(C, P ) = |C| · |P | KL-Divergence, first introduced by Kullback and Leibler in 1951 [15]. In information theory, it has been used to indicate the divergence between two distributions. For probability distributions C and P of a discrete random variable, the KL-Divergence of P from C is defined to be: DKL (C||P ) = X i C(i) log C(i) P (i) + ε (2) Rel(C, P) will be calculated as: Rel(C, P ) = 1 − DKL (C||P ) (3) 4 Unlike cosine similarity, KL-Divergence does not satisfy symmetry and triangle inequality, C and P are two different distribution densities of random variable χ. KL-Divergence measures the difference between C and P. As for our problem, we would like to know how far each comment is from the blog text.χ indicates words in each comment. In (2), χ is referred to i, C(i) and P(i) represent the number of times the wordappears in each corresponding text. We have to claim that C and P in (3) do not represent term-frequency we used for (1). C indicates the distribution of each comment, the probability distribution of the blog body corresponds to P. As not every word in the comment will exist in the blog body, we need to add a small constant to deal with a zero denominator, in (2), the constant is ε,in our experiment, is greater than 0 and less than 0.5.Facts show that KL-Divergence is a good method to compute relevance in our model. Thus, with definition 1 and 2, we have the definition below. Definition 3. Comment Spam. If comment c satisfies Spam Info(c)¿ . Rel(C, P) ¡, then c is considered to be comment spam. 4 Content-based Spam Detection Bhattarai et al. [16] have analyzed several features of spam content. Based on their work, we selected 2,646 blogs randomly as the training set and labeled the data manually, as a result, we got 277 spams. According to the feature distribution of the set, we proposed some features to distinguish the normal comment from comment spam. Features used in our model will be discussed here. 4.1 The Length of the Comment Length is an apparent feature of a comment, so we investigated whether length is a good indicator of spam. To this end, we plotted length distribution for each comment, the result is showed in Fig. 1. This figure consists of a bar graph and a line graph. The bar graph depicts the distribution of a certain interval of length of all comments in our training set. The horizontal axis depicts a set of value ranges. The left scale of the vertical axis applies to the bar graph, and depicts the number of comments in training set that fell into a particular range. The right scale of the vertical axis applies to the line graph, and depicts the percentage of sampled comments in each range that have been judged to be spam. As can be observed in Fig. 1, short comments hold a high percentage. In general, the number of comments declines, while the possibility of appearance of spam rises, as the comment length increases. This supports the intuition that in the real world, normal comments are always short and to the point, while spammers will repeat its propaganda or discuss a topic not related to the text in detail. Obviously, such a feature cannot indicate comment spam alone. However, we will be able to identify a spam by combining other features, for example, a long comment with low relevance to the blog body is more likely to be a spam compared with a short one. 5 4.2 Similarity A regular comment is always relevant to the body of the blog, except some short ones. Thus, text similarity will do favor to filter comment spam. However, short normal comments may have low cosine similarity and long comment spam may have high cosine similarity, which is also the major drawback of measuring comment spam by computing text similarity. Like Fig. 1, Fig. 2 also has a bar graph and a line graph. Comments collect in area with low similarity. A commentator will not write long comments, also it will not comment on the blog with the original text of the blog content. Everyone may use its own words to represent its opinion. As text similarity is a statistical method, similarity between comment and the blog text is certainly not very high. When similarity is greater than 0.28, comments start to shrink and comment spam do vanish. Fig. 1. Relationship between comment spam and comment length. Fig. 2. Relationship between comment spam and similarity. 4.3 KL Divergence KL Divergence, a statistical method, emphasizing the difference between the texts, has a good effect on exploring the text difference. Short comments may 6 have low similarity but low divergence comparatively speaking. Long comment spam may have high similarity but high divergence relatively speaking. Thats why we use this feature as a complement to the method of cosine similarity. Compared with Fig. 2, comments collect in area with low divergence, indicating Fig. 3. Relationship between comment spam and KL-Divergence. that comments related to the blog content account for the majority. When divergence becomes high, the possibility of being comment spam goes up. Still we cannot identify comment spam by divergence alone. 4.4 Popular Words Ratio and Propaganda Generally speaking, comment spam always contains propagandistic information to propagate the commentators website or its business. So whether a comment contains URL, phone number, E-mail address, MSN number, or whatever, will be a good cue for a spam. Fig. 4 shows the distribution of comments on Propaganda.Some short comments, such as bravo, marvelous and other common comment terms, though meaningless, should not be classified as comment spam actually. Hence, we collect a common glossary to check popular words ratio of a comment to identify such short and low correlation comments. Popular words ratio is used to measure the proportion of those words we mentioned above in a comment. Fig. 5 shows the distribution of comments on popular words ratio. Clearly, when a comment contains a lot of propagandas, it must be comment spam. Normal comments will not contain propagandas or contain little. Fig. 4 indicates that propaganda is a feature with high discriminability. Through the distribution of popular words ratio, we notice that when a comment only contains popular words it will not be considered to be a spam. Spams stress on area where the ratio of popular words is low. Here we may discuss a kind of comment spam, from Fig. 5, we find that there is still comment spam even when the ratio is greater than 0.4. Some spammers may use some popular words to disguise their comments. Behind these words, they always put the propagandistic information. This very fact underscores that we should combine features mentioned above to identify comment spam. 7 Fig. 4. Distribution of Propagandas 5 Fig. 5. Distribution of Popular words Ratio Using Classifiers to Combine Heuristics How can we identify comment spam with features mentioned above? Each feature should not be considered alone, the combination of these features would access to our purpose. Our purpose is to classify comments, so we should take some classification algorithms into account. We build 3 simple models on the training set to see which method is better. Results are showed in Table 1. Table 1. Classification Model Comparision Models SVM Naive Bayes DecisionT ree(c4.5) Precision 67.1% 70.9% 84.7% Recall 30.7% 46.0% 44.9% Due to the unadjusted parameters, these models do not exhibit their best performance, especially the values of recall. However, we can see that decision tree model performs better and thus it is selected and enhanced as our final solution. A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences [17]. Actually, identifying comment spam is a decision process. Each feature can decide whether a comment is a spam, with the decision of one feature we get new classification results of comments, finally all features will make their decisions to obtain the ultimate results whether a comment can be ranked as a spam. We are encouraged to employ a heuristic decision method to build a treemodel similar to decision tree. In simple terms, after we get features of each comment, we apply statistical method to measure a single features ability to classify the comment. The best feature will be chosen as the root of a tree. Then every possible decision of the root will be treated as its son. The corresponding comments will be put under the proper branch. Repeat the process; choose the current best feature from the comment set embedded in branch nodes. The best feature should have high resolution. For mass behavior is credible, when a comments value of a feature closes to the integral level, it will be ranked as a normal one, else further decisions are needed. We examine skewness coefficient1 of the distribution on each feature of all comments to select 8 the best feature. Symmetrical distribution always implies lower discriminability compared with long-tailed distribution. Thus, the bigger the absolute value of skewness coefficient is, the better the feature is. We may take the mean value or mode of the distribution as a threshold. Note that there is big difference in the style and participants of a blog site in real network. So strictly speaking, neither best features nor threshold is the same for different blog sites, in other words, a uniform model cannot receive precise results. Fig. 6 is a typical shape of the core part of our model. In Fig. 6 the best feature is propaganda as containing propaganda shows a great potential of comment spam existing. Continually, similarity and divergence of propagandistic comments are computed to minimize misjudgment. Comments without propagandistic information will be divided into three classes. Short comments with low popular words ratio should be considered to be comment spam. For long comments, we only use divergence to decide whether they are comment spam to avoid the noise of the long text. Threshold selection will be discussed in experimental section. Fig. 6. Core Part of the heuristic model. 6 6.1 Experiments and Results Data Collection Nobody can accomplish anything without the necessary means, however, large benchmark data sets of labeled blog comment spam do not yet exist [8].Thus, we have to run our experiment on the only publicly available blog sites. Fig. 7 displays how we get our dataset. Our data mainly comes from popular blog websites: Sohu, Sina and Baidu, from which we get many representative comments. We select some seed sites , for each seed site, we apply fixed URL 9 Fig. 7. Data Collection pattern to obtain the URL list of blogs. Finally, we pick out such data that the length of the blog body is greater than 500 and the quantity of the comment surpass 25 from the crawling pages. The description of the crawling data will be found in Table 2. The crawling data is our testing set, which is different from our training set mentioned in section 4 and 5. As lack of hand-crafted training data, we have to label the raw corpus manually to form our relatively small scale training set. Table 2. Statistical Data of Comments Blogsites Sumofarticles Avgofartilcleslength Sumofcomments Avgofcomments Baidu 298 545.3 11,563 38.8 Sina 359 845.3 10,518 29.3 583 784.2 31,908 54.7 Sohu Total 1240 741.3 53,989 43.5 6.2 Distribution of Comment Feature Fig. 8 shows the distribution of comments length on each site. Assuming that most users will release valid comment, we come to this conclusion that the possibility of a comment being a spam increased by its distance from the center which is represented by threshold. So our threshold will exceed the mean value and the mode within a certain range. From Fig. 9, we can see that the distribution of propagandas is a discrete long-tail distribution. Popular words ratio gives focus on some short comment, we have collected 20 words from the blog comments as popular words. Fig. 10 indicates the distribution of this feature. A number of comments contain popular words. Thus this feature can help us differentiate the meaningless comment from comment spam well. Finally, we analyze the similarity and divergence between the comment and the blog body. These two aspects can depict the relationship between the comment and the body accurately. Fig. 11 and Fig. 12 are these two distributions. We find that distributions of various sites are consistent. According to the distribution, we could define the similarity threshold for our model. Divergence treats the comment and the body as two language model. The divergence value distributes uniformly, mainly concentrates in the area of [-1,0]. 10 Fig. 8. Distribution of Comments Length on each site Fig. 9. Distribution of Propagandas Fig. 10. Distribution of Popular words Ratio Fig. 11. Distribution of Similarity Fig. 12. Distribution of KL Divergence 11 6.3 Results and Evaluation After analyzing the distribution of each feature, we build our heuristic model successfully. We get results in Table 3 by processing the blog comment with our model. In general, comment spam makes up about 20% of the total comments. In our model, Sohu blogs are mostly current affairs which would gain more attentions, as a result, the Sohu site has a high proportion of comment spam. Table 3. Statistical Data of Comments Blogsites Sumofarticles Sumofcomments Sumofcomment spam Proportion Baidu 298 11,563 1,108 9.58% Sina 359 10,518 3,354 31.88% Sohu 583 31,908 5,852 18.34% Total 1240 53,989 10,314 19.1% To evaluate our model, we have to compute precision and recall of the results. Precision is an important performance index of a model, it indicates the effectiveness of our method and the credibility of our model. Precision P will be calculated as: Cactual (4) P = Cf ind Cactual is the amount of actual comment spam in our result. Cf ind is the amount of comment spam that our model has found. Recall evaluates the identification ability of our model. While ensuring precision, higher recall shows the ability of our model to identify comment spam. So enhance recall is also an important part in our experiment. Recall R will be calculated as: Cactual R= (5) Ctotal Ctotal is the amount of comment spam in our corpus. Table 4. Statistical Data of Comments BlogSites Baidu Sina Sohu total Precision 92.6%(463/500) 91.4%(457/500) 87.2%(436/500) 90.4%(1,356/1500) Recall 86.4%(432/500) 83.0%(415/500) 84.2%(421/500) 84.5%(1,268/1500) We apply sample annotation to reduce the workload. For precision, we select 500 comments from the results randomly, and then mark comment spam manually by some measures3. With the number of the marked comments, we will get the precision. To get the recall, we extract comment spam with our 12 heuristic model from a total of 500 spams that been marked in the whole comment set. With the above methods, we will get the precision and recall of the three different sites. From Table 4 we can see that the precision of our model is fundamentally satisfied. However, various comment spam have different forms of expression, which causes the low recall of our model, suggesting that finding more comment spam should be the focus of our future work. 7 Conclusions This research deals with comment spam. Initially we defined comment spam, simultaneously, we proposed some features and analyzed these features of comments with statistical methods. Statistics show that comment spam would be filtered out by combining these features. Experiments show that our heuristic model can find comment spam effectively with high precision and recall. As far as we concerned, research on comment spam is still a novel topic at present. As a significant attempt, our method has acquired satisfied effects. In future, we should try to build determination model more reasonably to enhance recall. On the other hand, we could apply more knowledge and technology of other fields, such as Nature Language Processing, to combat comment spam more aggressively.
© Copyright 2026 Paperzz