Web Spam Detection Using MapReduce Approach to Collective

Web Spam Detection Using MapReduce
Approach to Collective Classification
Wojciech Indyk1 , Tomasz Kajdanowicz1 , Przemyslaw Kazienko1 , and Slawomir
Plamowski1
Wroclaw University of Technology, Wroclaw, Poland
Faculty of Computer Science and Management
{wojciech.indyk,tomasz.kajdanowicz,kazienko,slawomir.plamowski}@pwr.wroc.pl
Abstract. The web spam detection problem was considered in the paper. Based on interconnected spam and no-spam hosts a collective classification approach based on label propagation is aimed at discovering the
spam hosts. Each host is represented as network node and links between
hosts constitute network’s edges. The proposed method provides reasonable results and is able to compute large data as is settled in MapReduce
programming model.1
Keywords: MapReduce, collective classification, classification in networks, label propagation, web spam detection
1
Introduction
Recently networks have become one of commonly used model for representation
of relations among objects. The most natural example of networked structure
are hosts interconnected by hyperlinks placed in web pages deployed on hosts.
Utilizing the information of network’s structure it is possible to classify nodes.
This means that based on partial labelling of the nodes in the network it is
possible to discover labels of the rest of nodes.
Nodes may be classified in networks either by inference based on known
profiles of these nodes (regular concept of classification based on attributes of
nodes) or based on relational information derived from the network. This second
approach utilizes information about connections between nodes (structure of the
network) and can be very useful in assigning labels to the nodes being classified.
For example, it is very likely that a given web page x is related to sport (label
sport), if x is linked by many other web pages about sport.
Hence, a form of collective classification should be provided, with simultaneous decision making on every node’s label rather than classifying each node
separately. Such approach allows taking into account correlations between connected nodes, which deliver usually undervalued knowledge.
1
This is not the final version of this paper. You can find the final version on the
publisher web page.
2
Moreover, arising trend of data explosion in transactional systems requires
more sophisticated methods in order to analyse enormous amount of data. There
is a huge need to process big data in parallel, especially in complex analysis like
collective classification.
MapReduce approach to collective classification which is able to perform
processing on huge data is proposed and examined in the paper in order to deal
with spam data. The collective classification algorithm, revoked in this paper has
been introduced in [1] in domain of telecommunication customers’ classification.
In this paper we examine its abilities to web spam detection.
Section 2 covers related work while in Section 3 appears a proposal of MapReduce approach to label propagation in the network, that is able to perform a
classification of spam hosts. Section 4, contain description of the experimental
setup and obtained results. The paper is concluded in Section 5.
2
Related Work
Web spam detection problems may be solved using numerous approaches. Previous work on web spam detection focus on two distinct subproblems, namely
(i) content spam detection (originating from email spam detection) [2] and
(ii) link spam detection [3, 4, 5].
2.1
Content spam
Content spam relies on creating web pages containing keywords, that are generic
and more related to most of the queries, than to the actual page content. Such
malicious behaviour affects the outcome of search engines, such as PageRank
[6]. To overcome this problem traditional classifiers have been incorporated [2].
On the other hand, more sophisticated solution using language model disagreement has been introduced in [7].
2.2
Link spam
The latter issue also affects the outcome of web ranking algorithms. According
to [8] link spamming refers to ’manipulation of the link structure by a group
of users with the intent of improving the rating of one or more users in the
group’. Several solutions has been suggested to overcome the issue. For some of
them, propagation of some kind of score underlies search engines performance
improvement, while others rely on machine learning concepts. In [9] a few techniques has been gathered. From among the most common TrustRank [10, 11],
BadRank [12] and SpamRank [13] can be distinguished. More sophisticated
approach is introduced to in [14]. Propagation of trust and distrust through web
links has been incorporated to identify the potential synergistic gains for pages
interconnected in a spam farm. In [9] quite different approach is presented. In
order to improve search engine performance, noisy links are removed at site level,
instead of single page level. Another solution relies on statistical analysis using
3
such properties as linkage structure, page content and page evolution [15]. An
assumption that certain classes of spam pages diverge in some of their properties
from remaining web pages underlies this method. From among supervised learning methods, classification is the most obvious one. Several different classifiers
have already been tested in the domain of web spam detection. In [16] simple
binary trees have been incorporated, while [17] used SVM models based on local
and link-based features.
2.3
Joining link-based and content-based features
Utilizing both content-based and link-based features facilitates thorough web
analysis [18]. Traditional pattern recognition methods assume independence between objects/record/instances. In case of web spam detection, however, this is
not necessarily always true, as there exist some dependency among web pages
and web hosts. Evidently, links between pages and hosts are not distributed
randomly. Similar and dissimilar pages are often linked together. Nonetheless,
content resemblance is frequent situation according to [5].
Moreover, using the topology of the web graph by exploiting dependencies
between pages, boosts inference properties of intelligent system. It has been
shown that linked pages or hosts tend to share the same class label, provided
that there exist considerable number of such connections [18]. Intuitively, both
spam and non spam hosts are likely to be linked together.
Another issue that has to be addressed in order to obtain efficient and scalable
inference system, is the size of analysed web graph. As the Internet has grown to
unprecedented scale, processing each page separately has become unsatisfactory.
To overcome this obstacle, web page analysis has been replaced by host analysis
[18, 9]. Such solution is based on the assumption that spam and non spam pages
originate form the same web hosts.
Incorporating web graph topology to the inference system may be done in
several manners, as described in [18]. Possible approaches embrace the following methods: (i) clustering the host graph (ii) label propagation to neighbouring hosts which reinforces inference properties of ensemble classifiers (iii) using
neighbouring labels as new features describing each node
3
Collective Classification by Means of Label Propagation
Using MapReduce
The most common way to utilize the information of labelled and unlabelled data
is to construct a graph from data and perform a Markov random walk on it. The
idea of Markov random walk has been used multiple times, e.g. [19, 20, 21], and
involves defining a probability distribution over the labels for each node in the
graph. In case of labelled nodes, the distribution reflects the true labels. The
aim then is to recover this distribution for the unlabelled nodes. Utilizing Label
Propagation approach allows performing classification based on relational data.
4
Let G(V, E, W ) denote a graph with vertices V , edges E and an n × n edge
weight matrix W . In a weighted graph G(V,E,W) with n = |V | vertices, label
propagation may be solved by linear equations 1 and 2 [20].
∀i, j ∈ V
X
wij Fi =
(i,j)∈E
∀i ∈ V
X
X
wij Fj
(1)
(i,j)∈E
Fi = 1,
(2)
c∈classes(i)
where Fi denotes the probability density of classes for node i.
Assuming that VL denotes labelled vertices and VU – unlabelled ones, such
that V = VL ∪ VU , let Fu denote the probability distribution over the labels
associated with each vertex u ∈ V . For each node v ∈ VL , for which Fv is
known, a dummy node v 0 is inserted such that wvv0 = 1 and Fv0 = Fv , which
resembles the ’clamping’ operation, discussed in [20]. Let’s further assume that
VD denotes set of dummy nodes. The, considering the above, the solution to
equations 1 and 2 can be obtained using Iterative Label Propagation algorithm
3.
Algorithm 1 The pseudo code of Iterative Label Propagation algorithm.
1: repeat
2:
for all vP∈ V do
wuv Fu
P
3:
Fv = (u,v)∈E
(u,v)wuv
4:
end for
5: until convergence
Using some local information, namely node’s neighbourhood, each node is
processed inside the loop. As far as web graph is considered, however, sequential
evaluation leads to severe lack of effectiveness. Therefore a parallel version of
the algorithm is presented in algorithm 2.
Algorithm 2 The pseudo code of MapReduce approach to Iterative Label Propagation algorithm.
1:
2:
3:
4:
1:
2:
map < node; adjacencyList >
for all n ∈ adjacencyList do
propagate< n; node.label, n.weight >
end for
reduce < n; list(node.label,
weight) >
P
P
propagate< n; node.label·weigth
>
weight
MapReduce version of Iterative Label Propagation algorithm comprises two
phases. The Map phase fetches all labelled and dummy nodes and propagate
5
their labels to all nodes in adjacency list, taking into account edge weights between nodes. During the Reduce phase new label for each node with at least one
labelled neighbour is calculated. Reducers obtain new label for nodes using the
list of labelled neighbours and relation strength between nodes (weight). The
final result, that is a new label for a particular node, is computed as weighted
sum of labels’ probabilities from neighbourhood.
4
Experiments and Results
In the experimental studies we aimed at checking the usability of proposed collective classification method to spam detecion. Thus, due to relational structure
of internet infrastructure, collective classification for web spam data has been
incorporated. The Webspam-UK20072 was chosen for experiments. The dataset
consists of networked data of 114 529 interconnected hosts in the .UK domain.
Part of the set was labelled as 344 spam and 5709 no-spam hosts. Remaining
hosts serve as a tool to the influence propagation. Connections between hosts
were represented by links between websites of the hosts. Strengths of links were
calculated according to the following equation
c(a, b)
,
e(a, b) = X
c(a, x)
(3)
x∈V
where V is the set of hosts, e(a, b) represents the strength link from a to b, c(a, b)
– the number of connections from host a to host b.
Presented in equation 3 formula for strength calculation of links implies that
it is normalized, namely
X
∀b:(a,b)∈E
e(x, b) = 1,
x∈V
which is important for dummy node influence (see section 3 and [1]).
For the purpose of experiments, an original, provided from the dataset owners2
split into training and testing dataset was remained. This enabled the possibility
to compare experimental results of proposed method with the winning method
from Web Spam Challenge organized within Fourth International Workshop on
Adversarial Information Retrieval on the Web on AIRWeb 2008 Conference. The
reference method used for comparison has been published in [22]. The training
set consisted of 3776 no-spam and 222 spam hosts, while the test set of 1933
no-spam and 122 spam hosts. According to node outdegree distribution of training and test set presented in Figure 4 we can conclude that original split into
training and testing sets remained similar class distributions for both data. During preprocessing phase, directions of links were reverted due to the meaning
2
Yahoo! Research: ”Web Spam Collections”.
http://barcelona.research.yahoo.net/webspam/datasets/ Crawled by the Laboratory of Web Algorithmics, University of Milan, http://law.dsi.unimi.it/.
6
(I)
(II)
Fig. 1. The network’s outdegree distribution (I) and indegree distribution (II).
of influence relation. Influence relation is directed opposite to link relation, see
Figure 4.
Fig. 2. Relation between the influence and link direction: a link from Website 2 (i.e.
your homepage) to Website 1 (i.e. the Wikipedia) means that Website 1 influences
Website 2.
As it can be noticed, some hosts do not have links to others, namely are
separated. Therefore, an influence cannot be propagated and the hosts do not
have any label assigned by the method. During the experiments it was assured
that all hosts in test set are not separated and if so, the results of evaluation
were consistent.
During the experiments, the influence propagation accuracy has been examined in distinct number of algorithm iterations. Each iteration is able to propa-
7
(I)
(II)
Fig. 3. Training (I) and test (II) set outdegree distribution.
gate labels one link further from starting training nodes. In the given data, after
just four iterations, all nodes have been labelled.
In the examined data set we can observe a strong skewness in spam and
no-spam hosts distribution. This caused a difficulty for proposed propagation
algorithm. Thus the gathered results shows that the best accuracy is achieved
in the very first iteration of algorithm. According to [18, 9] the spam hosts aggregates in close neighbourhood. Iterating more in propagation algorithm makes
the no-spam labels propagated deeper in network structure and injects undesirable influence in spam agglomerations. Finally, after multiple iterations of label
propagation we observe no-spam labels overwhelmed a structure of the whole
network.
However, we tried to overcome the unbalanced data set problem by modelling
the threshold, that is used to obtain the decision of label assignment. We examined 5 threshold values (from 0.1 to 0.5). The final class was assigned according
to a Equation 4.
(
spam ,
p(host) ≥ threshold
class(host) =
(4)
no − spam , p(host) < threshold,
where class(x) is a result of the algorithm for host x.
As we can see in Figure 4, the best result of F-measure was achieved for
threshold equal to 0.1.
Moreover, there can be mentioned another factor that may influence the difficulties in spam data modelling. The proposed algorithm is performing well under
”power law” distribution of nodes degree. Unfortunately, the indegree distribution of the influence relation network conforms rather the Poisson distribution
than the power law one, see figure 1.
8
(I)
(II)
Fig. 4. (I) F-measure for consecutive threshold of probability of spam label.
(II) True negative rate for identically parameters as (I).
As the dataset was deeply evaluated in above mentioned Web Spam Challenge, we compared our best results with the winners’ of the competition [22].
The results of proposed and reference method are presented in figure 4, which
depicts ROC curves for the first iteration and the fourth iteration of the algorithm.
As it can be observed, the local collective classification, performing only
single iteration of propagation, provides comparative results to the best results
of Web Spam Challenge. Moreover, comparing to reference method it does not
require additional description (attributes) of hosts that are required by reference
method in order to classify hosts.
5
Conclusions
The collective classification method based on label propagation was proposed for
spam hosts evaluation. The performed experiments as well data analysis revealed
that the approach can be applied to the spam data problem, however it requires
specific adjustments in configuration.
Further experimentation will consider a comparison of algorithm’s behaviour
for various datasets distributions (including numerous domains). Moreover, the
model of web-spam classification will be improved by utilizing clustering algorithms for spam farms detection.
Acknowledgement
This work was partially supported by The Polish National Center of Science the
research project 2011-2012, 2011-2014 and Fellowship co-financed by The European Union within The European Social Fund. Calculations have been carried
9
(I)
(II)
Fig. 5. (I) ROC curve after first iteration and (II) after fourth iteration of proposed
algorithm.
out in Wroclaw Centre for Networking and Supercomputing (http://www.wcss.wroc.pl),
grant No. 177. The authors are grateful for granting access to the computing infrastructure.
References
[1] Indyk, W., Kajdanowicz, T., Kazienko, P., Plamowski, S.: Mapreduce approach
to collective classification. In: ICAISC. Volume 7267 of LNCS. (2012) 656–663
[2] Ntoulas, A., Najork, M., Manasse, M., Fetterly, D.: Detecting spam web pages
through content analysis. In: Proceedings of the 15th international conference on
World Wide Web. WWW ’06, New York, NY, USA, ACM (2006) 83–92
[3] Gyongyi, Z., Garcia-Molina, H.: Web spam taxonomy. Technical Report 2004-25,
Stanford InfoLab (March 2004)
[4] Drost, I., Scheffer, T.: Thwarting the nigritude ultramarine: learning to identify
link spam. In: In Proceedings of the 16th European Conference on Machine
Learning (ECML. (2005) 233–243
[5] Davison, B.D.: Recognizing nepotistic links on the web. In: In AAAI-2000 Workshop on Artificial Intelligence for Web Search, AAAI Press (2000) 23–28
[6] Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking:
Bringing order to the web (1999)
[7] Mishne, G.: Blocking blog spam with language model disagreement. In: In Proceedings of the First International Workshop on Adversarial Information Retrieval
on the Web (AIRWeb. (2005)
[8] Zhang, H., Goel, A., Govindan, R., Mason, K., Roy, B.V.: Making eigenvectorbased reputation systems robust to collusion. In: Proceedings of the third Workshop on Web Graphs (WAW). Volume 3243 of Lecture Notes in Computer Science.,
Springer (October 2004) 92–104
10
[9] da Costa Carvalho, A.L., Chirita, P.A., Carvalho, C., Calado, P., Alex, P., Chirita,
R., Moura, E.S.D., Nejdl, W.: Site level noise removal for search engines. In: In
Proc. of International World Wide Web Conference (WWW, ACM Press (2006)
73–82
[10] Gyngyi, Z., Garcia-molina, H., Pedersen, J.: Combating web spam with trustrank.
In: In VLDB, Morgan Kaufmann (2004) 576–587
[11] Wu, B., Goel, V., Davison, B.D.: Topical trustrank: using topicality to combat
web spam (2006)
[12] Wu, B., Davison, B.D.: Identifying link farm spam pages. In: Proceedings of the
14th International World Wide Web Conference, ACM Press (2005) 820–829
[13] Benczur, A.A., Csalogany, K., Sarlos, T., Uher, M., Uher, M.: Spamrank - fully
automatic link spam detection. In: In Proceedings of the First International
Workshop on Adversarial Information Retrieval on the Web (AIRWeb. (2005)
[14] Gyngyi, Z., Garcia-molina, H.: Link spam alliances. In: In Proceedings of the
31st International Conference on Very Large Data Bases (VLDB. (2005) 517–528
[15] Fetterly, D., Manasse, M., Najork, M.: Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. In: Proceedings of the 7th
International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004. WebDB ’04, New York, NY, USA, ACM (2004) 1–6
[16] Becchetti, L., Castillo, C., Donato, D., Leonardi, S., Baeza-Yates, R.: Link-based
characterization and detection of web spam. In: Proceedings of the 2nd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb).
(2006)
[17] Kolari, P., Java, A., Finin, T., Oates, T., Joshi, A.: Detecting spam blogs: A
machine learning approach. In: 2006. Proceedings of the 21st National Conference
on Artificial Intelligence (AAAI. (2006)
[18] Castillo, C., Donato, D., Gionis, A., Murdock, V., Silvestri, F.: Know your neighbors: web spam detection using the web topology. In: Proceedings of the 30th
annual international ACM SIGIR conference on Research and development in
information retrieval. SIGIR ’07, New York, NY, USA, ACM (2007) 423–430
[19] Szummer, M., Jaakkola, T.: Clustering and efficient use of unlabeled examples.
In: In Proceedings of Neural Information Processing Systems (NIPS). (2001)
[20] Zhu, X., Ghahramani, Z., Lafferty, J.: Semi-supervised learning using gaussian
fields and harmonic functions. In: In Proceedings of the International Conference
on Machine Learning (ICML). (2003)
[21] Azran, A.: The rendezvous algorithm: Multiclass semi-supervised learning with
markov random walks. In: In Proceedings of the International Conference on
Machine Learning (ICML). (2007)
[22] Geng, G., Li, Q., Zhang, X.: Link based small sample learning for web spam
detection. In: Proceedings of the 18th international conference on World wide
web. WWW ’09 (2009) 1185–1186