Web Spam Detection: New Approach with Hidden Markov Models

Web Spam Detection: New Approach
with Hidden Markov Models
Ali Asghar Torabi1, Kaveh Taghipour2,*, and Shahram Khadivi1
1
Human Language Technology Lab, Department of Computer Engineering and IT, Amirkabir
University of Technology
{a.torabi,khadivi}@aut.ac.ir
2
Department of Computer Science, National University of Singapore, 13 Computing Drive,
Singapore 117417
[email protected]
Abstract. Web Spam is the result of a number of methods to deceive search engine algorithms so as to obtain higher ranks in the search results. Advanced
spammers use keyword and link stuffing methods to create farms of spam pages. Most of the recent works in the web spam detection literature utilize graph
based methods to enhance the accuracy of this task. This paper is basically a
probabilistic approach that uses content and link based features to detect the
web spam pages. Since we observe there is a high connectivity between web
spam pages, we adopt a method based on Hidden Markov Model to exploit
conditional dependency of a sequence of hosts and their spam/normal class distribution of each host. Experimental results show that the proposed method can
significantly improve the performance of baseline classifier.
Keywords: web spam, link spam, hidden Markov models, ant colony optimization.
1
Introduction
Given the vast amount of information on the web, users have to use search engines to
locate the useful web pages that are relevant to their interests and inquiries. The goal
of search engines as the main information retrieval machines for web is to provide
higher ranks for pages that are most important and relevant to the users’ query. Therefore, search engines need to distinguish the normal web pages from Spam pages so as
to prevent misleading of the users [1].
In order to find desired contents, Search engines use specific textual similarity
measures for determining the relevancy of a page and a query. To measure the importance of the pages there are several global query-independent indicators like Page
Rank [2] that often are calculated from web link structure [3]. While these two important criteria are used in search engines for evaluating web pages, a new industry of
Search Engine Optimizers has developed (SEO) recently. Malaga and Ross [4]
*
This research was conducted at the time Taghipour was in AmirKabir University of Tehran.
R.E. Banchs et al. (Eds.): AIRS 2013, LNCS 8281, pp. 239–250, 2013.
© Springer-Verlag Berlin Heidelberg 2013
240
A.A. Torabi, K. Taghipour, and S. Khadivi
grouped the SEO methods into two categories: White hat SEOs that stay within the
guidelines by search engines and Web Spams (also known as Black hat SEOs) that
violate the rules and transgress accepted norms. Web Spamming means boosting the
rates of web pages undeservedly, without improving the true value of a page.
The web Spamming methods cause crucial problems for search engines, e.g., they
tremendously waste the resources for indexing illegitimate web pages, unduly decrease the quality of retrieval process, damage search engines [5].
According to [3] Web Spam techniques can be categorized to the following types:
• Content Spam: If Spammers target textual similarity measures it is a content spam
generation method. The content of pages is filled by popular words so that they are
relevant to more popular users’ queries. In [6] the term “keyword-stuffing” is used
to refer to this method.
• Link Spam: There is a general belief that pages with more incoming links are more
popular and important than others. As mentioned, Search engines use some linkbased measures like Page Rank to assess importance of a web page. Spamming
methods that intend to influence these algorithms are named Link Spam. Spammers create so many pages that link to the target page to increase its popularity.
Extensive researches have been presented to reduce the impact of Web Spam. Most of
the proposed solutions such as [6, 7] considered Web Spam detection as a classification problem. This research considers hosts as train/test instances and features are
extracted from content of pages within the host and links among them. Previous experiences show that in Web Spam Detection, instances are not independent and data
labels are unbalanced [8]. In this paper we present a new approach to handle biasness
of data and model the dependency between hosts. To our knowledge this is the first
time that Hidden Markov Model (HMM) is used to do this. The proposed system
starts by building a classifier based on Aggregating One Dependence Estimators
(AODE) [9]. Hidden Markov Model helps us to consider the dependency of web hosts
during the prediction process and boost the performance of AODE. A simple method
to adopt HMM for this task is to find the most frequent sequences of visiting hosts. In
the proposed system, Ant Colony Optimization algorithm is used to generate the required sequences of hosts.
The paper is organized as follows. In Section 2, we provide an overview of previous works. In Section 3, feature selection and classification method are described.
In Section 4, we propose a method that extracts sequences of hosts that we need to
apply HMM on these sequences. Finally, we conclude by summarizing key principals
of our approach.
2
Related Work
Several automatic techniques for web spam detection have been presented in the literature. Fetterly et al. [10] justified the statistical difference between machine generated spam pages and normal web pages. They presented some features based on page
content, linkage structure and page evolution. In their next following paper [6] they
Web Spam Detection: New Approach with HMM
241
also proposed several content based features and a decision tree, classifying Spam and
Normal pages. Piskorski et al [11] studied on some linguistic features and discovered
several discriminative features that are available publicly for others. Moreover in [12]
Araujo et al, They offered a new approach rooted in combination of link based
features and language model based ones. They observed the semantic relation between linked pages and found them to be useful to improve the performance of the
classification task.
In addition to traditional learning models many papers used graph based methods
to boost the performance of Web Spam Detection by considering topological dependency between the hosts. Link propagation as one the most popular methods in graph
based problems has been widely used in web spam detection. Becchetti et al. [13]
performed a statistical analysis on link structure of web pages in a large collection.
Their experiments show that link based metrics like TrustRank and Truncated PageRank can improve the performance of Web Spam classifiers. TrustRank separates an
initial set of good pages. It starts with a seed of good pages and then follows the link
structure to propagate the rank thorough the related pages. Implementation of this
method is described in [14]. Truncated PageRank is a version of PageRank that ignores the direct contribution of near neighbors according to a damping function. Experiments by Becchetti and others in [13, 15] show that Truncated PageRank is a
discriminative feature. Castillo et al [5] proposed stacked graphical learning for propagating labels across the web graph. In addition to content and link based features the
average of probability of Spam for neighboring hosts is added to the feature vector of
each host and thus is considered in decision process.
Link refinement, elimination and regularization methods are other methodologies
that exploit link structure to improve performance of basic classifiers. Elimination of
Nepotistic links is one of the proposed method to reduce the impact of the link stuffing by removing certain links of the web graph [16]. Jacob Abernethy et al in [17]
presented a graph regularization based algorithm, WITCH, that learns simultaneously
from graph structure, content and link based features.
There are also other works and experiences in this regard. For additional studies on
the above mentioned topics, you can refer to the Survey on Web Spam Detection by
Nikita Spirin and Jiawei Ha [8] , that is a good survey covering many papers and proposed systems to date.
3
Classification and Feature Selection
The following paper has tested and trained the proposed method by WebSpamUk2006 [18] which is a public Web Spam dataset. This collection contains 11402
hosts from the .uk domain. For each host 263 features have been extracted from links
and content of pages. Additional information about feature types and list of them is
available at [6, 13]. In this dataset, 7473 hosts are labeled by a group of volunteers
into three categories of Spam, Normal and Borderline. Here, we use the first two categories to build a model that recognizes spam hosts from normal ones.
242
A.A. Torabi, K. Taghipour, and S. Khadivi
In this research several classifiers such as decision trees, neural networks and statistical classifiers were examined and compared against each other. We use F-measure
value as a criterion to compare efficiency of different classification methods. The
results showed that AODE was superior to other competitive methods. AODE is a
statistical classifier that achieves higher accuracy by averaging over all naïve Bayes
models. The core of above mentioned classifier is to consider weaker independency
assumption for Naïve Bayes. Thus it has less error rate and still is as fast as possible
in training and test phases. In comparison with other methods like Bayesian networks,
AODE benefits from the advantage of not performing model selection while it’s accuracy is comparable with none parametric models like decision trees and neural networks [9]. The proposal has taken advantage of AODE by using its implementation in
Weka [19].
A subset of two sorts of link based and content based features have been used in
this paper. The paper approach is to use Wrapper Feature Selection method with
which evaluates features by using a learning model [20] and chooses the most discriminative and relevant features. After feature selection process, only 22 of 263 features
have been selected. By using cross validation and a bidirectional search an optimal
feature subset has been found that results the best accuracy for AODE classifier. Table 1 shows the performance for different feature selection models and AODE classifier. In this paper "true positive" and "false positive" rates are respectively about
"the rate of correctly detected spam hosts" and "normal hosts that are detected as
spam incorrectly". Tables 1 and 3-5 are result of cross validation with 10 folds.
Table 1. Comparision of feature selection methods
True positive rate
False positive rate
F-Measure
Correlation Attribute
Evaluation
Principal Component
Analysis
Wrapper feature
selection
74%
9%
0.72
77.2%
10.4%
0.724
81.2%
4.9%
0.825
Reported results show that Wrapper feature selection improves the result by increasing true positive and reducing false positive ratio which has resulted in the selection of AODE and the new feature space to setup the proposed system.
4
Smoothing
In this Section, the research proposes a method to detect Web spam by using topological dependency between hosts. It started by representing the web spam detection as a
graph based problem. In this presentation each host is a node in graph
,
of
Web hosts and links . For each pair of nodes and we have
, the number of links from host to .
Most of the traditional classifiers presuppose that instances are independent and
identically distributed. But in web spam problem, samples are topologically dependent [21, 22]. Therefore lots of latent information in the link structure between hosts
Web Spam Detection: New Approach with HMM
243
are missed if we confine ourselves to use only the base classifier in the previous Section. The proposed system is based on smoothness assumption of semi-supervised
learning. According to this, nodes which are closer to each other are more likely to
share a label [23]. In web graph for each pair of nodes the paper declared a closeness
factor:
(1)
log 1
,
On the other hand Castillo et al [5] showed in their experiments that normal nodes
tend to be linked by very few Spam nodes and they mainly link to other normal nodes
while Spam nodes are mainly linked to Spam nodes. In Table 2 the results of our
study on WebSpam-Uk2006 dataset are shown. Probability of transition from a spam
host to a normal host is 23% and that is much lower than transition to other spam
host.
Table 2. Dependency of Spam and Normal classes
Spam
77%
13%
Spam
Normal
Normal
23%
87%
Considering the conditional dependency of Spam and Normal hosts to each other
and topological dependency of hosts in the web graph, and also hidden category of
each host; it is intuitively obvious that Hidden Markov Model would be a useful
learning schema to build a pattern of dependency between nodes and handle imbalance between spam and normal labels.
HMM is a probabilistic method to model sequence of data [24]. Indeed, to take
HMM into use we need a sequence of connected hosts. This paper proposes Ant colony optimization algorithm to extract sequences of related hosts according to similarity
measure (1). Fig. 1 presents the workflow of the proposed system.
Host Webgraph
ACO
Sequences for
Baum Welch
Generating sequences
for HMM
HMM
HMM
parameters
Training on sequences
Train
Host Webgraph
ACO
Target Host
Generating sequences
for HMM
Sequences
for
forward algorithm
HMM
Calculating probability of
Spam and Normal on
sequences
Test
Fig. 1. Web spam detection workflow
Average
probability
of Spam and
Normal
244
4.1
A.A. Torabi, K. Taghipour, and S. Khadivi
Ant Colony Optimization
In computer science, Ant Colony Optimization refers to a general propose method of
finding the best solution of an optimization problem. In ACO, artificial ants build
solutions and exchange information by depositing pheromone on the ground in order
to remark favorable path to the optimization problem [25]. To use ACO it is needed to
represent the problem space as a graph and declare three fundamental components:
• Edge selection: Artificial ants move from vertex to vertex along the edges of the
graph. A stochastic mechanism is hired to guide each ant to choose edge , in
each step of walk. This mechanism uses a probability distribution based on heuristic function
and pheromone values
. The probability function is defined as:
,
∑
0
(2)
and
are "the pheromone value on edge , " and "heuristic
In equation (2)
function", respectively.
represents a set of the neighboring hosts pointing
to host that has not yet visited by ant .
• Heuristic function: The heuristic function has been defined using the assumptions
we mentioned at the beginning of this Section. Equation (3) illustrates heuristic
function
that is the same as similarity measure in equation (1). Therefore an
ant that is in the web host chooses an edge , that has more input links than
others.
1
,
(3)
• Pheromone update: According to [25] each artificial ant should update pheromone
on edge , after each step of walk to communicate with other ants. These pheromone's updates incrementally specify the best paths of connected hosts.
1
1
,
∑
,
(4)
This study combines offline and local pheromone updates [25] in one formula. Equation (4) explains the defined pheromone's update function.
represents the
value of pheromone on edge , in iteration and
, is the number of links
from to . Real number 0
1 is a decay coefficient. According to the proposed equation, amount of pheromone on each edge decreases over time. Higher
Web Spam Detection: New Approach with HMM
245
value of gives a greater chance to other paths to be selected by edge selection method in the next following iterations and as a result we will have more paths of connected hosts.
4.2
Hidden Markov Model
HMM is a stochastic extension of Markov Model with hidden states. In this model,
states are not visible but probability of states and transition between them are given
by state dependent functions [24, 25]. This paper defined two states Spam and Normal. The visible output and emission probability functions are respectively the 22
dimensional feature vectors and AODE model that was presented in Section 3. Since
AODE is a probabilistic model [9] that here predicts posterior class probabilities, this
model is appropriate to be used as emission probability of HMM.
Fig. 2. Sequences of hosts to host t
In training phase, Baum Welch algorithm is run on generated sequences from ACO
to estimate the transition matrix
and the initial probabilities . All HMM parameters are recalculated in maximization step of Baum Welch except emission probabilities
that have already been estimated using an AODE classifier.
In test phase, the label of host will be predicted by the proposed system. It first
extracts sequences of hosts that are linked to the by ACO. Fig. 2 illustrates an example of host sequences with length three that are linked to the target web host. The
forward algorithm, Equation (5) is used to calculate probability of normal and spam
hosts according to each sequence.
|
,
:
,
,
(5)
:
,
|
,
Z |Z
Z
,X
:
246
A.A. Torabi, K. Taghipour, and S. Khadivi
|
spam is the probability of observing feature vector
in the state
spam.
Z ,X
refers to the initial probabilities of spam and normal and
Z |Z
is the transition probability distribution. Four possible transitions are as
follows:
─
─
─
─
Finally to predict the label of target host, average probabilities of
spam| :
and
nomal| : over all sequences are used. In Table 3, the performance of
the new classification method is shown.
Table 3. Smoothing by one HMM
True positive rate
False positive rate
F-Measure
4.3
AODE
HMM
81.2%
4.9%
0.825
81.8%
4.3%
0.836
Multiple HMMs
So far, we only use the output of ACO algorithm in the system, i.e., we only use the
best sequences to train a single HMM. In this Section, we use the values of pheromones to better estimate the HMM parameters. We assume the pheromone values of
each edge as a measure of conditional dependency between two hosts. Here, we
introduce a technique for label smoothing by using multiple HMMs.
In Fig. 3, the results of our experiments on .uk 2006 dataset are shown. In these
experiments, we first use ACO with 100 artificial ants and then we extract sequences
with length 2. Then a discretization with 10 equal depth bins is performed on pheromone values. Afterwards, we train a Hidden Markov Model for each bin, so we have
ten different HMMs. Prior probability of spam Z
spam and transition probability Z
spam|Z
spam for each bin are presented in Fig. 3. According to
the reported parameters, the label of destination point is conditionally more dependent
on the source point when there is more pheromone on the edge between them. Furthermore, it illustrates that probability of spam has an inverse relation with amount of
pheromone.
The result of the above experiment is convincing enough to make use of different
HMM components to model relation between points with different dependency values
(Pheromones). For sequences with length two implementation of such a system with
non-parametric models is straightforward. But for sequences with length of three or
higher we should present a technique that considers pheromones in edges in depth two
or higher in addition to pheromones in first step of sequences. Please note that since
we aim to use non-parametric models in HMM, we need to discretize the edge values
so as to decrease the amount of sparsity.
Web Spam Detection: New Approach with HMM
247
1
Probability
0.8
0.6
P(X1->Spam|X2->Spam)
0.4
Probability of Spam
0.2
3.09-inf
0.2
0.28
0.12
0.1072
0.1033
0.1015
0.1006
0.1002
0.05
0
Fig. 3. Dependency between HMM parameters and pheromone value
In this paper two approaches were examined for sequences with length three. In the
first one, a weight for each sequence of two edges was defined by multiplication of
pheromones of the edges. Afterwards the binning was performed on these weights. In
the second approach, the binning is applied two times. First on the edges connected to
the target host, and second time on the edges connected to the neighbors of the target
host. The binning algorithm introduces ten bins in each run. Therefore, a 10 10
table of 100 HMM component was created. Each sequence was assigned to one of the
HMM components according to the amount of pheromone on their edges. For example if the first edge of sequences belonged to bin 3 and second edge belonged to bin 6
the system assigned the sequence to the HMM number 18 in row 3 and column 6 of
the pheromone table.
The train and test phases are the same as before; unless
spam| : and
nomal| : should be computed according to the appropriate HMM component. Table 4 shows the result of the experiments.
Table 4. Camparison of proposed approaches
True positive rate
False positive rate
F-Measure
HMM
Multiple HMMs of
order 2
Multiple HMMs of
order 3 first approach
Multiple HMMs of order 3
second approach
81.8%
4.3%
0.836
88.1%
6.5%
0.843
87.6%
6.7%
0.838
91.7%
6.9%
0.859
As you can see in Table 4, the proposed system achieves an improvement by using
several HMM components on sequences with length 2. The performance of system
but then is reduced when first approach was used to model sequences with length 3.
However it is raised again when the second approach is used.
According to our experiments the general trend of using this method is a considerable increase of 10 percent in detection rate of baseline classifier while F-measure has
also improved from 0.825 to 0.859. Next Section is a comparison of the result of this
study and other existing methods to show to what extent the application of HMMs is
contributed to the improvement for Web spam detection.
248
5
A.A. Torabi, K. Taghipour, and S. Khadivi
Conclusion
For many applications like Web spam detection the i.i.d assumption would fail to
exploit dependency patterns between data points. This study proposed a system to
detect web spam according to the content and link based features and dependency
between points in web host graph. To our knowledge this is the first attempt to boost
the performance of web spam detection using Hidden Markov Models. Table 5 shows
a comparison between the presented method and other systems according to the FMeasure value. Experimental results show that the proposed method is effective and
yields better performance in comparison with other works on the same feature set.
Geng et al in [26] boosted the performance of classification task using under sampling
method and reached F-measure 0.759. Castillo et al [5] as one of the most significant
studies on the web spam detection reports F-Measure 0.763 using stacked graphical
learning. Benczúr et al [27] reported F-measure 0.738 following the same methodology as Castillo et al [5].
To compare the performance of the proposed system with the results of the participants in web spam challenge [28] The study also evaluated the proposed method on
the test set provided by the organization.
Table 5. Comparing performance of systems
Web Spam Detection system
Our Proposed system
Castillo et al
Geng et al
Benczúr et al
Filoche et al
Abou et al
Fetterly et al
Cormack
F-Measure
Test Set
Cross Validation
0.90
NA
0.87
0.91
0.88
0.81
0.79
0.67
0.85
0.76
0.75
0.73
NA
NA
NA
NA
One disadvantage of the proposal system is the number of the needed HMMs in the
second approach. For instance using the second approach needs to create 1000 HMM
models for sequences of length 4; which is a proof that it is time consuming to estimate parameters of these HMMs. In near future, we plan to propose a HMM with
parametric transition probabilities that can handle the weights of the edges. Moreover
we intend to employ a new content based feature using language modeling techniques. Based on the ongoing researches and studies on the topic we strongly believe
that it is possible to achieve better performance using these new features.
References
1. Henzinger, M.R., Motwani, R., Silverstein, C.: Challenges in web search engines. In:
ACM SIGIR Forum, pp. 11–22. ACM (2002)
2. Bianchini, M., Gori, M., Scarselli, F.: Inside pagerank. ACM Transactions on Internet
Technology (TOIT) 5, 92–128 (2005)
Web Spam Detection: New Approach with HMM
249
3. Gyongyi, Z., Garcia-Molina, H.: Web spam taxonomy. In: First International Workshop on
Adversarial Information Retrieval on the Web, AIRWeb 2005 (2005)
4. Malaga, R.A.: Search Engine Optimization—Black and White Hat Approaches. Advances
in Computers 78, 1–39 (2010)
5. Castillo, C., Donato, D., Gionis, A., Murdock, V., Silvestri, F.: Know your neighbors: Web
spam detection using the web topology. In: Proceedings of the 30th Annual International
ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 423–
430. ACM (2007)
6. Ntoulas, A., Najork, M., Manasse, M., Fetterly, D.: Detecting spam web pages through
content analysis. In: Proceedings of the 15th International Conference on World Wide
Web, pp. 83–92. ACM (2006)
7. Mahmoudi, M., Yari, A., Khadivi, S.: Web spam detection based on discriminative content
and link features. In: 2010 5th International Symposium on Telecommunications (IST), pp.
542–546. IEEE (2010)
8. Spirin, N., Han, J.: Survey on web spam detection: principles and algorithms. ACM
SIGKDD Explorations Newsletter 13, 50–64 (2012)
9. Webb, G.I., Boughton, J.R., Wang, Z.: Not so naive Bayes: aggregating one-dependence
estimators. Machine Learning 58, 5–24 (2005)
10. Fetterly, D., Manasse, M., Najork, M.: Spam, damn spam, and statistics: Using statistical
analysis to locate spam web pages. In: Proceedings of the 7th International Workshop on
the Web and Databases: Colocated with ACM SIGMOD/PODS 2004, pp. 1–6. ACM
(2004)
11. Piskorski, J., Sydow, M., Weiss, D.: Exploring linguistic features for Web spam detection:
A preliminary study. In: Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web, pp. 25–28. ACM (2008)
12. Araujo, L., Martinez-Romo, J.: Web spam detection: new classification features based on
qualified link analysis and language models. IEEE Transactions on Information Forensics
and Security 5, 581–590 (2010)
13. Becchetti, L., Castillo, C., Donato, D., Leonardi, S., Baeza-Yates, R.: Link-based characterization and detection of web spam. In: 2nd Intl. Workshop on Adversarial Information
Retrieval on the Web (AIRWeb), pp. 1–8 (2006)
14. Gyöngyi, Z., Garcia-Molina, H., Pedersen, J.: Combating web spam with trustrank. In:
Proceedings of the Thirtieth international conference on Very Large Data Bases, vol. 30,
pp. 576–587. VLDB Endowment (2004)
15. Becchetti, L., Castillo, C., Donato, D., Leonardi, S., Baeza-Yates, R.: Using rank propagation and probabilistic counting for link-based spam detection. In: Proc. of WebKDD
(2006)
16. Davison, B.D.: Recognizing nepotistic links on the web. Artificial Intelligence for Web
Search, 23–28 (2000)
17. Abernethy, J., Chapelle, O., Castillo, C.: Web spam identification through content and
hyperlinks. In: Proceedings of the 4th International Workshop on Adversarial Information
Retrieval on the Web, pp. 41–44. ACM (2008)
18. Yahoo! Research: Web Spam Collections,
http://barcelona.research.yahoo.net/webspam/datasets/ Crawled by
the Laboratory of Web Algorithmics, University of Milan, http://law.dsi.unimi.it/
(retrieved August 8, 2012)
19. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA
data mining software: an update. SIGKDD Explor. Newsl. 11, 10–18 (2009)
250
A.A. Torabi, K. Taghipour, and S. Khadivi
20. Kohavi, R., John, G.H.: Wrappers for feature subset selection. Artificial intelligence 97,
273–324 (1997)
21. Menczer, F.: Mapping the semantics of Web text and links. IEEE Internet Computing 9,
27–36 (2005)
22. Chakrabarti, S., Joshi, M.M., Punera, K., Pennock, D.M.: The structure of broad topics on
the web. In: Proceedings of the 11th International Conference on World Wide Web, pp.
251–262. ACM (2002)
23. Chapelle, O., Schölkopf, B., Zien, A.: Semi-supervised learning. MIT Press (2006)
24. Rabiner, L., Juang, B.: An introduction to hidden Markov models. IEEE ASSP Magazine 3, 4–16 (1986)
25. Dorigo, M., Birattari, M., Stutzle, T.: Ant colony optimization. IEEE Computational Intelligence Magazine 1, 28–39 (2006)
26. Geng, G.-G., Wang, C.-H., Li, Q.-D., Xu, L., Jin, X.-B.: Boosting the performance of web
spam detection with ensemble under-sampling classification. In: Fourth International Conference on Fuzzy Systems and Knowledge Discovery, FSKD 2007, pp. 583–587. IEEE
(2007)
27. Benczúr, A., Bíró, I., Csalogány, K., Sarlós, T.: Web spam detection via commercial intent
analysis. In: Proceedings of the 3rd International Workshop on Adversarial Information
Retrieval on the Web, pp. 89–92. ACM (2007)
28. Web Spam Challenge (2007), http://webspam.lip6.fr/

Download Report

Web Spam Detection: New Approach with Hidden Markov Models

Paperzz.com

Your Paperzz