Rule-Based On-the-fly Web Spambot Detection Using Action Strings

RULE-BASED ON-THE-FLY WEB
SPAMBOT DETECTION USING
ACTION STRINGS
Pedram Hayati ([email protected])
Vidyasagar Potdar, Alex Talevski & William F. Smyth
What is Spam 2.0
2

Propagation of unsolicited, anonymous, mass content
to infiltrate legitimate web 2.0 applications”
www.antispamresearchlab.com
How does Spam 2.0 work
3

Web Spambots (Spambot)
A
web crawler/tool that navigates the WWW with the
sole purpose of planting unsolicited content on external
web 2.0 applications
www.antispamresearchlab.com
How is Spam 2.0 currently managed
4


Flood control, Nonce, Hash-Cash, Email validation
Completely Automated Public Turing test to tell
Computers and Human Apart (CAPTCHA)
www.antispamresearchlab.com
Problem
5

CAPTCHA
 Decreases
human users’ convenience
 Computers are getting more powerful to decipher it.

Content-Based solutions (Option Spam, Social
Spam, Video Spam etc.)
 Focussed
on one particular form of spam
 Do not come with satisfactory results.
www.antispamresearchlab.com
Solution Idea
6


Main assumption: human web usage behaviour is
intrinsically different from spambot behaviour.
Web usage data
 User
click-stream
 Widely used
 Two additional attributes
 Session
ID
 Username
www.antispamresearchlab.com
Solution: Action
7

Action
 Model
web usage data into a behavioural model
 Set of user efforts to achieve certain purposes
 Suitable discriminative feature to model user behaviour
 Extendible to many other Web 2.0 platforms

Example
 Register
1.
2.
3.
a new user account action
User navigate to registration page
User fill up registration form fields
User click on submit button
www.antispamresearchlab.com
Solution: Action String
8

Actions String
 Sequence
of action in alphabetical format
www.antispamresearchlab.com
Solution: Trie
9

A way to store and retrieve information
 Ease
of updating and handling
 Shorter access time
 Removing redundancies
 form of a tree structure.

We construct actions strings using Trie data structure
 fast
on-the-fly pattern matching
www.antispamresearchlab.com
Solution: Framework
10
www.antispamresearchlab.com
Solution: Framework
11
www.antispamresearchlab.com
Performance Measurement
12



Matthews Correlation Coefficient (MCC)
Best performance measurement methods of binary
classifications
Considers true and false positives and returns a
value between -1 and +1.
www.antispamresearchlab.com
Experiment
13

Data Set
No publicly available collection
 Spambot data from our HoneySpam 2.0 project
 Human data from an active forum
 16594 entries

11039 spambots records
 5555 human records


Test
Five random datasets (DS1 to DS5)
 2/3 for building up Trie structure
 1/3 for test

www.antispamresearchlab.com
Experiment: On-The-Fly Detection
14



Simulate real world practices where user action
strings grow over the time
System creates action strings as they happen.
Make a window over test action strings
 Run
our classifier
 Increase the window’s size

ABCDEFG
ABCDEFG
Aim: identify spambot in the least amount of actions
www.antispamresearchlab.com
Experiment: Results
15


Window size ranges from 2 to 10 characters
Threshold from -0.05 to 0.05
www.antispamresearchlab.com
Experiment: Results
16
www.antispamresearchlab.com
Experiment: Discussion
17


System can predict better as user uses the system
over time.
Performance remains the same after some windows
size
 Datasets
are randomly selected
 Same happens for accuracy of results
www.antispamresearchlab.com
Conclusion
18



Quite young area of research.
Current work Focussed on one particular type of
spam.
Our aim: detect web spambots as a source of spam
problems on the Web 2.0 platform.
 Based
on web usage behaviour
 Formulated into Actions => Action String
 On-the-fly detection: using Trie

Result: average accuracy of 93%
www.antispamresearchlab.com
THANK YOU!
Pedram Hayati ([email protected])
Vidyasagar Potdar ([email protected])
Alex Talevski ([email protected])
William F. Smyth ([email protected])
http://www.antispamresearchlab.com
http://debii.curtin.edu.au
http://www.curtin.edu.au
Appendix: Related Works
20




Tan et al.: web robot navigational patterns such as
session length and set of visited webpages is different
from those of humans.
Park et al.: malicious web robot detection based on
types of requests for web objects and existence of
mouse/keyboard activity
Göbel et al. : interaction with spam botnet controllers.
Yu et al. and Yiqun et al. : categorise spam webpages
from legitimate webpages by employing user web
access logs
www.antispamresearchlab.com
Appendix: Future Works
21



Compare different performance measurement
techniques.
Develop adaptive solution
Experiment on different platforms (e.g. datasets)
www.antispamresearchlab.com