RULE-BASED ON-THE-FLY WEB SPAMBOT DETECTION USING ACTION STRINGS Pedram Hayati ([email protected]) Vidyasagar Potdar, Alex Talevski & William F. Smyth What is Spam 2.0 2 Propagation of unsolicited, anonymous, mass content to infiltrate legitimate web 2.0 applications” www.antispamresearchlab.com How does Spam 2.0 work 3 Web Spambots (Spambot) A web crawler/tool that navigates the WWW with the sole purpose of planting unsolicited content on external web 2.0 applications www.antispamresearchlab.com How is Spam 2.0 currently managed 4 Flood control, Nonce, Hash-Cash, Email validation Completely Automated Public Turing test to tell Computers and Human Apart (CAPTCHA) www.antispamresearchlab.com Problem 5 CAPTCHA Decreases human users’ convenience Computers are getting more powerful to decipher it. Content-Based solutions (Option Spam, Social Spam, Video Spam etc.) Focussed on one particular form of spam Do not come with satisfactory results. www.antispamresearchlab.com Solution Idea 6 Main assumption: human web usage behaviour is intrinsically different from spambot behaviour. Web usage data User click-stream Widely used Two additional attributes Session ID Username www.antispamresearchlab.com Solution: Action 7 Action Model web usage data into a behavioural model Set of user efforts to achieve certain purposes Suitable discriminative feature to model user behaviour Extendible to many other Web 2.0 platforms Example Register 1. 2. 3. a new user account action User navigate to registration page User fill up registration form fields User click on submit button www.antispamresearchlab.com Solution: Action String 8 Actions String Sequence of action in alphabetical format www.antispamresearchlab.com Solution: Trie 9 A way to store and retrieve information Ease of updating and handling Shorter access time Removing redundancies form of a tree structure. We construct actions strings using Trie data structure fast on-the-fly pattern matching www.antispamresearchlab.com Solution: Framework 10 www.antispamresearchlab.com Solution: Framework 11 www.antispamresearchlab.com Performance Measurement 12 Matthews Correlation Coefficient (MCC) Best performance measurement methods of binary classifications Considers true and false positives and returns a value between -1 and +1. www.antispamresearchlab.com Experiment 13 Data Set No publicly available collection Spambot data from our HoneySpam 2.0 project Human data from an active forum 16594 entries 11039 spambots records 5555 human records Test Five random datasets (DS1 to DS5) 2/3 for building up Trie structure 1/3 for test www.antispamresearchlab.com Experiment: On-The-Fly Detection 14 Simulate real world practices where user action strings grow over the time System creates action strings as they happen. Make a window over test action strings Run our classifier Increase the window’s size ABCDEFG ABCDEFG Aim: identify spambot in the least amount of actions www.antispamresearchlab.com Experiment: Results 15 Window size ranges from 2 to 10 characters Threshold from -0.05 to 0.05 www.antispamresearchlab.com Experiment: Results 16 www.antispamresearchlab.com Experiment: Discussion 17 System can predict better as user uses the system over time. Performance remains the same after some windows size Datasets are randomly selected Same happens for accuracy of results www.antispamresearchlab.com Conclusion 18 Quite young area of research. Current work Focussed on one particular type of spam. Our aim: detect web spambots as a source of spam problems on the Web 2.0 platform. Based on web usage behaviour Formulated into Actions => Action String On-the-fly detection: using Trie Result: average accuracy of 93% www.antispamresearchlab.com THANK YOU! Pedram Hayati ([email protected]) Vidyasagar Potdar ([email protected]) Alex Talevski ([email protected]) William F. Smyth ([email protected]) http://www.antispamresearchlab.com http://debii.curtin.edu.au http://www.curtin.edu.au Appendix: Related Works 20 Tan et al.: web robot navigational patterns such as session length and set of visited webpages is different from those of humans. Park et al.: malicious web robot detection based on types of requests for web objects and existence of mouse/keyboard activity Göbel et al. : interaction with spam botnet controllers. Yu et al. and Yiqun et al. : categorise spam webpages from legitimate webpages by employing user web access logs www.antispamresearchlab.com Appendix: Future Works 21 Compare different performance measurement techniques. Develop adaptive solution Experiment on different platforms (e.g. datasets) www.antispamresearchlab.com
© Copyright 2026 Paperzz