Comprehensive Literature Review on Machine Learning Structures

Stable web spam detection using
features based on lexical items
Source: Computers&Security,2014,46:79-93
Authors: M. Luckner,M. Gad,P. Sobkowiak
Speaker: Jia Qing Wang
Date:
2016/12/08
1
Outline
•
•
•
•
•
Introduction
Proposed features
Experiments &Results
Discussions
Conclusion
2
Introduction(1/3)
 The typical Web spam detection scheme
Extract features
Dimensionality reduction use
feature selection and feature
extraction methods
Novel
high-quality features for web pages
Link features
Content features
……
PCA
LDA
……
Classifier
Experiment result
3
Introduction(2/3)
 The main contributions of this paper
 Create a web spam detector that works over years by using datasets from
different year as training and testing sets;
 Select several new features based on lexical items;
 Verify the high influence of the selected new features;
 Improve accuracy of Web spam detection.
4
Introduction(3/3)
 Data preprocessing
Contains only the pure
text between tags
obtained by removing all space
characters from the Visible
Text document
v i a g r a to viagra.
set of unique domain names
extracted from the Visible Text
Documents and whole of origin
documents
5
Proposed features(1/7)
 Commonly used statistics in computing features in Web spam detection
 the average length, maximum length, and standard deviation of the
length
 Basic features
 Statistics of links, URLs, domains and words.
 the number of words in the title of a HTML document
 the number of dots in document's domain
 the count of IP addresses in Distinct Domain document
 the rate of compression by bzip2 algorithm, entropy of chars, entropy of
words, and length for both origin texts and visible ones
6
Proposed features(2/7)
 Features based on Consonant Clusters (6)
 A Consonant Cluster event was defined as a sequence of three or more
consonants. (extracted by the regular expression)
 In the Distinct Domains documents and the Non-blank Visible Text
document, statistics of Consonant Cluster were calculated as new
features. (6)
 Detecting spam of created words: such as PRlCE, PROFlTS, or
SATlSFACTlON.
‘l’  ‘I’
7
Proposed features(3/7)
 Weird Combinations (2)
 Such as: v1agra, p0rn, credit4U, StuffForFree, qwq23ewc
 In the 𝐷𝑖𝑠𝑡𝑖𝑛𝑐𝑡 𝐷𝑜𝑚𝑎𝑖𝑛𝑠 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡:
 In the Visible Text document:
𝑡𝑜𝑡𝑎𝑙 𝑤𝑒𝑖𝑟𝑑 𝑐𝑜𝑚𝑏𝑖𝑛𝑎𝑡𝑖𝑜𝑛𝑠
𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑢𝑛𝑖𝑞𝑢𝑒 𝑑𝑜𝑚𝑎𝑖𝑛𝑠
𝑡𝑜𝑡𝑎𝑙 𝑤𝑒𝐶𝑠
𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑎𝑙𝑙 𝑐ℎ𝑎𝑟𝑎𝑐𝑡𝑒𝑟𝑠
 Detecting spam of hiding prohibited content.
8
Proposed features(4/7)
 Analysis of chars (8)
 In the Visible Text document and the Distinct Domains document (2)
𝑛𝑜𝑛−𝐴𝑆𝐶𝐼𝐼 𝑐ℎ𝑎𝑟𝑎𝑐𝑡𝑒𝑟𝑠
𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑎𝑙𝑙 𝑐ℎ𝑎𝑟𝑎𝑐𝑡𝑒𝑟𝑠
 In the Non-blank Visible Text documents, statistics of all continuous
sequences of letters from the Latin alphabet. (3)
 In the Non-blank Visible Text documents, statistics of all continuous
sequences of non-Latin symbols. (3)
9
Proposed features(5/7)
 Analysis of lexical items (based on words and syllables in the Visible Text document)
 Word items: a continuous sequence of letters that was not prefixed or suffixed
with numbers or underscores
 Word Syllable Count feature: the number of continuous sequences of the basic
vowel characters. (1)
 the average count of syllables in a word, the maximum count of syllables in a
word, and the standard deviation of Word Syllable Count distribution. (3)
 Sentence Count feature : regular expression. (1)
10
Proposed features(6/7)
 Gunning Fog Index
the number of words
whose syllable count
is greater than 2
It can be useful to detect spam created by Internet bots or by persons with
a limited vocabulary.
11
Proposed features(7/7)
 Significance of features (22 new features)
12
Experiments &Results (1/7)
Test the stability of dataset usage, try to find a web spam detector that works over years
 Datasets : two datasets, WEBSPAMeUK2006 and WEBSPAMeUK2007, used
interchangeably as the learning set and the testing set.
 Classifier : modified SVM
 where f(x) is a distance to the decision line and p(x) is the calculated probability of correct
classification for the point x
 Evaluation measures:
AUC: area under the ROC curve
13
Experiments &Results (2/7)
14
Experiments &Results (3/7)
When test data from the same year, the AUC of 2006 sets are higher than 2007
sets, but when test data from different year, the stabilities are better than sensitivity,
but still worse than accuracy.
15
Experiments &Results (4/7)
Analyze the influence of new features
16
Experiments &Results (5/7)
 To prove that the difference was significant, we performed Wilcoxon's SignedRank test for paired scores.
 One method of Hypothesis Testing.
 Reject the null hypothesis (p =6.4 x 10-4), at 0.05 level.
 Accept (p=3.4 x 10-2), at the 0.05 level, the hypothesis that the mean difference
between the AUCs is 0.024.
 The full set of features was statistically significantly better for all pairs except
the learning set 2007 I and the testing sets 2006 I and 2006 II.
17
Experiments &Results (6/7)
Analysis of stability
 Split the data from 2006 into 30 random subsets, trained on the subset and
tested on all data from both 2006 and 2007.
18
Experiments &Results (7/7)
Analysis of stability
 Wilcoxon's Signed-Rank test for 30 paired scores.
 The accuracy for 2006 and 2007 was not statistically significantly different.
(difference is 0.01, stable)
 The specificity is also stable(difference is 0.001).
 The sensitivity: the average difference is 0.35.
 That AUC: the average difference in the AUC between years is 0.18.
19
Discussions (1/2)
methods
UK2006/AUC UK2007/AUC
Our method
0.895
0.745
Qualified link analysis and language
models [1]
0.88
0.76
Multilayer Perceptrons and Support
Vector Machines [2]
0.80
0.72
methods
AUC
Our method
0.738
The C4.5 tree classifier trained on the data from 2006
and tested on the data from 2007 [3]
0.73
[1] Araujo L, Martinez-Romo J. Web spam detection: new classification features based on qualified link analysis and language models. Information
Forensics and Security, IEEE Transactions on 2010;5(3):581e90. http://dx.doi.org/10.1109/TIFS.2010.2050767.
[2] Goh, K. L., Singh, A., & Lim, K. H. Multilayer perceptrons neural network based web spam detection application. In: Signal and Information
Processing (ChinaSIP), 2013 IEEE China Summit International Conference on, 2013 (pp. 636e640).
http://dx.doi.org/10.1109/ChinaSIP.2013.6625419.
[3] Erdelyi M, Benczur AA, Masanes J, Siklosi D. Web spam filtering in internet archives. In: Proceedings of the 5th International Workshop on
Adversarial Information Retrieval on the Web, AIRWeb '09. New York, NY, USA: ACM; 2009. p. 17e20.
http://dx.doi.org/10.1145/1531914.1531918. URL, http://dx.doi.org/10.1145/1531914.1531918
20
Discussions (2/2)
 we used the WEKA toolkit to create random forests (RF) evaluated by the
10-fold cross validation on the WEBSPAM-UK2007 dataset.
 The obtained AUC (0.991) is better than in the discussed works[4][5][6].
[4] Bíró I, Sikló i D, Szabó J, Benczúr AA. Linked latent dirichlet allocation in web spam filtering. In: Proceedings of the 5th International
Workshop on Adversarial Information Retrieval on the Web, AIRWeb '09. New York, NY, USA: ACM; 2009. p. 37e40.
http://dx.doi.org/10.1145/1531914.1531922.
[5] Erdélyi M, Garzó A, Benczúr AA. Web spam classification: a few features worth more. In: Proceedings of the 2011 Joint WICOW/AIRWeb
Workshop on Web Quality, WebQuality '11. New York, NY, USA: ACM; 2011. p. 27e34. http://dx.doi.org/10.1145/1964114.1964121.
[6] Dong C, Zhou B. Effectively detecting content spam on the web using topical diversity measures. In: Proceedings of the The
2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology. WI-IAT '12, vol. 01. Washington,
DC, USA: IEEE Computer Society; 2012.p. 266-73.
21
Conclusion
 This paper has shown that data from WEBSPAM-UK2006 can be used to create
classifiers that work stably both on the WEBSPAM-UK2006 and WEBSPAM-
UK2007 datasets.
 This paper proved that the proposed new features improved the classification
results.
22
Thanks!
23