Spam Campaign Detection, Analysis, and Formalization

Spam Campaign Detection, Analysis, and
Formalization
Thèse
Mina Sheikhalishahi
Doctorat en informatique
Philosophiæ doctor (Ph.D.)
Québec, Canada
© Mina Sheikhalishahi, 2016
Spam Campaign Detection, Analysis, and
Formalization
Thèse
Mina Sheikhalishahi
Sous la direction de:
Directeur de recherche: Mohamed Mejri
Codirectrice de recherche: Nadia Tawbi
Résumé
Les courriels Spams (courriels indésirables ou pourriels) imposent des coûts annuels extrêmement lourds en termes de temps, d’espace de stockage et d’argent aux utilisateurs privés et aux
entreprises. Afin de lutter efficacement contre le problème des spams, il ne suffit pas d’arrêter
les messages de spam qui sont livrés à la boîte de réception de l’utilisateur. Il est obligatoire,
soit d’essayer de trouver et de persécuter les spammeurs qui, généralement, se cachent derrière
des réseaux complexes de dispositifs infectés, ou d’analyser le comportement des spammeurs
afin de trouver des stratégies de défense appropriées.
Cependant, une telle tâche est difficile en raison des techniques de camouflage, ce qui nécessite
une analyse manuelle des spams corrélés pour trouver les spammeurs.
Pour faciliter une telle analyse, qui doit être effectuée sur de grandes quantités des courriels
non classés, nous proposons une méthodologie de regroupement catégorique, nommé CCTree,
permettant de diviser un grand volume de spams en des campagnes, et ce, en se basant sur
leur similarité structurale. Nous montrons l’efficacité et l’efficience de notre algorithme de
clustering proposé par plusieurs expériences. Ensuite, une approche d’auto-apprentissage est
proposée pour étiqueter les campagnes de spam en se basant sur le but des spammeur, par
exemple, phishing. Les campagnes de spam marquées sont utilisées afin de former un classificateur, qui peut être appliqué dans la classification des nouveaux courriels de spam. En
outre, les campagnes marquées, avec un ensemble de quatre autres critères de classement, sont
ordonnées selon les priorités des enquêteurs.
Finalement, une structure basée sur le semiring est proposée pour la représentation abstraite
de CCTree. Le schéma abstrait de CCTree, nommé CCTree terme, est appliqué pour formaliser la parallélisation du CCTree. Grâce à un certain nombre d’analyses mathématiques et de
résultats expérimentaux, nous montrons l’efficience et l’efficacité du cadre proposé.
iii
Abstract
Spam emails yearly impose extremely heavy costs in terms of time, storage space, and money
to both private users and companies. To effectively fight the problem of spam emails, it is
not enough to stop spam messages to be delivered to end user inbox or be collected in spam
box. It is mandatory either to try to find and persecute the spammers, generally hiding behind complex networks of infected devices, which send spam emails against their user will, i.e.
botnets; or analyze the spammer behavior to find appropriate strategies against it. However,
such a task is difficult due to the camouflage techniques, which makes necessary a manual
analysis of correlated spam emails to find the spammers.
To facilitate such an analysis, which should be performed on large amounts of unclassified
raw emails, we propose a categorical clustering methodology, named CCTree, to divide large
amount of spam emails into spam campaigns by structural similarity. We show the effectiveness and efficiency of our proposed clustering algorithm through several experiments.
Afterwards, a self-learning approach is proposed to label spam campaigns based on the goal
of spammer, e.g. phishing. The labeled spam campaigns are used to train a classifier, which
can be applied in classifying new spam emails. Furthermore, the labeled campaigns, with the
set of four more ranking features, are ordered according to investigators priorities.
A semiring-based structure is proposed to abstract CCTree representation. Through several
theorems we show under some conditions the proposed approach fully abstracts the tree representation. The abstract schema of CCTree, named CCTree term, is applied to formalize
CCTree parallelism.
Through a number of mathematical analysis and experimental results, we show the efficiency
and effectiveness of our proposed framework as an automatic tool for spam campaign detection,
labeling, ranking, and formalization.
iv
Table des matières
Résumé
iii
Abstract
iv
Table des matières
v
Liste des tableaux
vii
Liste des figures
ix
Remerciements
xii
1 Introduction
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
3
6
7
2 State of the Art
2.1 Spam Emails Issues . . . . . . . . . . . . .
2.2 Clustering Spam emails into Campaigns .
2.3 Labeling and Ranking Spam Campaigns .
2.4 On the Formalization of Clustering and its
. . . . . . . .
. . . . . . . .
. . . . . . . .
Applications
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9
9
12
17
18
3 Spam Campaign Detection
3.1 Introduction . . . . . . . . . . . . . . .
3.2 Preliminary Notions . . . . . . . . . .
3.3 Related Works . . . . . . . . . . . . .
3.4 Categorical Clustering Tree (CCTree)
3.5 Time Complexity . . . . . . . . . . . .
3.6 Conclusion . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
22
22
25
28
30
32
33
4 Effectiveness and Efficiency of CCTree
4.1 Introduction . . . . . . . . . . . . . . .
4.2 Framework . . . . . . . . . . . . . . .
4.3 Evaluation and Results . . . . . . . . .
4.4 Discussion and Comparisons . . . . . .
4.5 Related Work . . . . . . . . . . . . . .
4.6 Conclusion . . . . . . . . . . . . . . .
in Spam Campaign Detection
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
34
34
36
38
56
57
58
v
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5
Labeling and Ranking Spam
5.1 Introduction . . . . . . . . .
5.2 Related Work . . . . . . . .
5.3 Digital Waste Sorting . . .
5.4 Results . . . . . . . . . . . .
5.5 Ranking Spam Campaigns .
5.6 Conclusion . . . . . . . . .
Campaigns
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
.
.
.
.
.
.
60
60
62
63
75
82
86
.
.
.
.
.
.
.
87
87
89
90
99
109
117
122
7 Conclusions and Future Work
7.1 Thesis Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
124
124
126
A Appendix
A.1 Source Codes of Proposed Approach . . . . . . . . . . . . . . . . . . . . . .
A.2 Tables of Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
130
130
138
Bibliography
144
Bibliographie
144
.
.
.
.
.
.
6 Algebraic Formalization of CCTree
6.1 Introduction . . . . . . . . . . . . . . . . . .
6.2 Related work . . . . . . . . . . . . . . . . .
6.3 Feature-Cluster Algebra . . . . . . . . . . .
6.4 Feature-Cluster (Family) Term Abstraction
6.5 Relations on Feature-Cluster Algebra . . .
6.6 CCTrees Parallelism . . . . . . . . . . . . .
6.7 Conclusion . . . . . . . . . . . . . . . . . .
vi
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Liste des tableaux
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
Features extracted from each email. . . . . . . . . . . . . . . . . . . . . . . . .
CCTree Internal evaluation with fixed number of elements. . . . . . . . . . . . .
Internal evaluation results of CCTree, COBWEB and CLOPE. . . . . . . . . .
Silhouette values and number of clusters in function of µ for four email datasets.
Silhouette result, hamming distance, = 0.001, and µ changes . . . . . . . . . .
Number of Clusters , = 0.001, and µ changes . . . . . . . . . . . . . . . . . .
External evaluation results of CCTree, COBWEB and CLOPE. . . . . . . . . .
Campaigns on the February 2015 dataset from five clustering methodologies. .
37
41
45
50
52
52
55
57
5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
5.9
Features extracted from each email. . . . . . . . . . . . . . . . . . . .
Feature vectors of a spam email for each class. . . . . . . . . . . . . .
Classification results evaluated with K-fold validation on training set. .
Classification results evaluated on test set. . . . . . . . . . . . . . . . .
Training set generated from small knowledge. . . . . . . . . . . . . . .
DWS classification results for the labeled spam campaigns. . . . . . .
Set of ranking features . . . . . . . . . . . . . . . . . . . . . . . . . . .
Normalized score of spam campaigns label . . . . . . . . . . . . . . . .
Three first ranked campaigns . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
71
72
77
77
81
81
82
84
85
6.1
6.2
CCTree Rewriting System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Composition Rewriting System . . . . . . . . . . . . . . . . . . . . . . . . . . .
114
119
7.1
Table of Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
129
A.1 Language of spam message and subject .
A.2 Type of Attachment . . . . . . . . . . .
A.3 Attachment Size . . . . . . . . . . . . .
A.4 Number of attachment . . . . . . . . . .
A.5 Average size of attachments . . . . . . .
A.6 Type of Message . . . . . . . . . . . . .
A.7 Length of Message . . . . . . . . . . . .
A.8 IP-based links verification . . . . . . . .
A.9 Mismatch links . . . . . . . . . . . . . .
A.10 Number of links . . . . . . . . . . . . . .
A.11 Number of Domains . . . . . . . . . . .
A.12 Average number of dots in links . . . . .
A.13 Hex character in links . . . . . . . . . .
A.14 Words in Subject . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
vii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
138
138
139
139
139
139
140
140
140
141
141
141
141
142
A.15 Characters in subject . . . . . . .
A.16 Non ASCII characters in subject
A.17 Recipients of spam email . . . . .
A.18 Images in spam messages . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
viii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
142
142
142
143
Liste des figures
1.1
1.2
1.3
Steady volume of spam. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Mcafee Report 2015. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The framework of thesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1
3.2
3.3
3.4
3.5
dataset 1 . . . .
dataset 2 . . . .
Spam 1 . . . . .
Spam 2 . . . . .
A Small CCTree
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
4.10
4.11
4.12
4.13
4.14
4.15
4.16
5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
5.9
5.10
5.11
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
26
26
27
27
31
CCTree(0.001,1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CCTree (0.01,1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CCTree(0.1,1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CCTree(0.5,1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Internal evaluation at the variation of the parameter. . . . . . . .
COBWEB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CCTree(0.001,1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CCTree(0.001,10) . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CCTree(0.001,100) . . . . . . . . . . . . . . . . . . . . . . . . . . .
CCTree(0.001,1000) . . . . . . . . . . . . . . . . . . . . . . . . . . .
CLOPE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Silhouette in function of the number of clusters for different values of
Sihouette (Hamming). . . . . . . . . . . . . . . . . . . . . . . . . . .
Generated Clusters. . . . . . . . . . . . . . . . . . . . . . . . . . . .
Sihouette (Hamming). . . . . . . . . . . . . . . . . . . . . . . . . . .
Generated Clusters. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
µ.
. .
. .
. .
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
42
42
43
43
44
46
47
47
48
48
49
50
50
51
53
54
Advertisement . . . . . . . . . . .
Portal . . . . . . . . . . . . . . . .
Fraud . . . . . . . . . . . . . . . .
Malware . . . . . . . . . . . . . . .
Crypto Ransomeware volume . . .
Phishing . . . . . . . . . . . . . . .
DWS Workflow. . . . . . . . . . .
Insert new instance X in a CCTree
ROC curve / Advertisement . . . .
ROC curve / Portal Redirection .
ROC curve / Fraud . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
64
66
66
68
69
70
73
74
78
78
79
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
ix
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2
3
8
.
.
.
.
.
.
.
.
.
.
.
5.12 ROC curve / Malware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.13 ROC curve/ Phishing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.1
6.2
A Small CCTree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Parallel Clustering Workflow. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
x
79
80
106
117
To my love, my family
and
To any one who looks for
worldwide peace and happiness
xi
Remerciements
Though only my name appears on the cover of this dissertation, a great many people have
contributed to its production. I owe my gratitude to all those people who have made this
dissertation possible.
First and foremost, I want to thank my supervisor, professor Mohamed Mejri, for accepting
me in his research group, which improved my view of life. I appreciate all his contributions of
time, ideas, patience, and funding to make my Ph.D. experience productive and stimulating.
Thanks for allowing me to grow as a research scientist, for all his patience and support.
I also would like to express my deeply thanks to my co-advisor, professor Nadia Tawbi, who
has been always there to listen and give advices. Thanks to her for all her kind mental,
financial supports and helpful discussions in different stages of my Ph.D. studies. I gratefully
acknowledge her support for my cooperation with IIT-CNR research group that changed my
life.
I really appreciate the insightful comments and constructive criticisms of my advisor and coadvisor at different stages of my research. For encouraging the use of correct grammar and
consistent notations in my writings.
Besides my advisors, I would like to thank the rest of my thesis committee : Prof. Fabio
Martinelli, Prof. Raphael Khoury, and Dr. Ilaria Matteucci, for their insightful comments and
encouragement. Special thanks to professor Fabio Martinelli for accepting me to join to his
research group in IIT-CNR, Italy, which enriched my research experience.
My time in Quebec was made enjoyable in large part due to the many friends that became part
of my life. I am grateful to my dearest Shadi, who supported me continuously during three
years of my staying in Quebec. With her presence in Quebec, I always felt I have a family
member who takes care of me. To my kind friend Bahareh, who several times I bothered from
Italy to do something in Quebec instead of me. Thanks to my other kind friends in Quebec :
Elaheh, Afrooz, Sheyda, Soamyeh.
I am especially grateful to my best friend Sara, who always, in very difficult moments of my
Ph.D., was available from Iran to send me messages, to support, encourage, and motivate me.
She was always there to hear me, although with different timezones of Iran and Canada. I will
xii
always appreciate all her kind continuous supports.
Many thanks to my other friends from Iran : Mahboobeh, for continuous memorizing and
praying me, Mahmoud, for always following my weblog and motivating me.
I would like to deeply thank my family for all their love and encouragement. To my father who
always motivated us to read, to know, to follow our dreams, who always love us as we are.
To my mother, who finally accepted my travel to Canada although never convinced, for all
the worries she passed during my Ph.D., for all her patience when I was following my dreams,
even against her dreams.
Thanks to my dearest sister, Mojgan, who was my joint to Iran. She was always following
what ever I needed to be done in Iran, who always motivated me with her typical sweet words.
Many thanks to my brother, Mohammad, who always supported me in all my pursuits, who
we are always proud of him. I am also grateful to Hamed, my brother-in-law, who called me
many times from Iran to tell me we all love you and miss you. To my kindest aunt, Azra, who
always teaches us that you can still smile when the life is passing its most difficult challenging
stage.
Most of all, I would like to give my deep gratitude to my colleague, my friend, my love, Andrea,
who cleared lots of the obstacles that I faced along my Ph.D. path. Who generously from the
first moments of my arrival in Italy, taught me his experiences of research. Many thanks, for
all his faithful support, patience, and encouragement during the difficult stages of my Ph.D.
thesis. Thanks for his presence in my life, for all happiness he brought with himself, for making
the feeling that I am able to make all my dreams come true.
Mina Sheikhalishahi
Laval University
Quebec, Canada
xiii
Chapitre 1
Introduction
The term spam became well-known from one comedy program of “Monty Python’s Flying
Circus”, where the servant was proposing dishes containing an unknown ingredient called spam,
which corresponds to a brand of canned meat produced by “American Hormel Foods Corp”. In
the sketches of this program, all the foods in the restaurant are served with lots of spam, and
the waitress repeats the word spam several times in describing how much spam is in the plates.
After doing this, a group of “Vikings” in the corner start a song : “Spam, spam, spam, spam,
spam, spam, spam, spam, lovely spam ! Wonderful spam ! ” Hence, the meaning of the term was
referring to something that keeps repeating and repeating to great annoyance 1 . Due to the
success of this program, probably since the canned meat constituted the only nutritious food
available in England during the Second World War, the term “SPAM” indicated something
inevitably omnipresent.
The name imported to unwanted electronic messages, believed that the first spam email has
been sent on 1 May 1978 by Digital Equipment Corporation to advertise a new product, and
sent to all the users of ARPAnet of the West Coast of the United States, containing a few
hundred people 2 .
Only many years later, after the birth (dating back to January 1994 3 ), the first unwanted
commercial message in large scale distributed across USENET, titled “Global Alert for All :
Jesus is Coming Soon”. It was posted to every newsgroup, indicating unwanted messages,
which were sent massively to unwilling recipients.
More precise definition of spam email got introduced later in the literature. [8] define Spam
email, also known as junk email or unsolicited bulk email, as an electronic message, sent
in bulk, against the will of the receiver. [83] define spam email as an unwanted email, sent
indiscriminately by a sender who has no current relationship with the receiver.
Nowadays, spam emails are not just undesired advertisement. The problem of unsolicited
1. http ://www.internetsociety.org/
2. www.templetons.com/brad/spamreact.html
3. www.wired.com
1
emails causes incredible huge costs to companies and private users [113], [83], [84]. Current
proposed approaches [30], [46], [123], though being quiet effective in stopping spam emails
to be delivered to end users inbox [21], [89], they do not propose a methodology to organize
huge amount of messages in order to be able to fight against the root of the problem, i.e. the
spammer.
Any effort in this regard requires a first analysis of large amount of spam emails, mostly collected in honey-pots. This first analysis demands grouping huge amount of data into smaller
groups, named spam campaigns, which are supposed to be originated from the same source
(spammer). Then, it is required to train a classifier to label and group new spam emails.
Furthermore, the large set of detected spam campaigns should be ordered based on the investigators’ priorities, automatically.
Figure 1.1 – Steady volume of spam.
To this end, in present thesis, we first propose a fast and effective categorical clustering algorithm, named CCTree, to detect spam campaigns on the base of structural similarity of
messages. Afterwards, we propose a self-learning methodology to automatically label detected
spam campaigns based on the goal of spammer. The labeled detected campaigns are ranked
automatically considering a set of ranking priorities. A semiring-based formal method is proposed to abstract CCTree representation. The abstract form is used to formalize the process
of clustering spam emails in parallel computers, which may help to speed up the process of
spam campaign detection.
2
1.1
Motivation
Being incredibly cheap to send, spam messages are vastly used by adversaries to steal money,
distribute malware, advertise the goods and/or services, etc.
Cisco Report, in 2015 [36], shows that although adversaries develop more sophisticated techniques to breach network defense, spam emails still continue to play a major role in these
attacks, and the worldwide volume of spam has remained relatively consistent (Figure 1.1).
Furthermore, it has been shown [36] that 4.5 billion emails get blocked every day. Internet
Threats Trend Report [114] estimates that 54 billion spam emails were sent per day in 2014.
According to McAfee 2015 Report [100], unsolicited emails constitute up more than 70 percent
of total amount of email messages in 2014 (Figure 1.2).
Figure 1.2 – Mcafee Report 2015.
Microsoft and Google [113] estimate spam emails cost to American firms and consumers up to
20 billion dollars per year. Ferris Research estimated the worldwide cost of spam in 2005 at
$50 billion, and raised its estimate to $100 billion in 2007 and $130 billion in 2009 4 ,[112]. [83]
report that 382 million mailing attempts resulted in 28 sales. Yahoo ! data on similar “high
ticket” items, which were sold through the marginal profit more than $50, shows that they
had conversation rates of about 1 in 25,000 [112].
4. www.email-museum.com/
3
The problem of undesired electronic messages became a serious issue, due to a lot of troubles
caused by spam to Internet Community. [5] categorize spam losses in three different groups,
named direct losses, indirect losses, and defense costs, and call the sum of these losses as the
society losses of spam. In what follows, the sets of society losses proposed in [5] are listed :
Direct losses by spam :
• “Money withdrawn from victim accounts
• Time and effort to reset account credentials (for banks and consumers)
• Distress suffered by victims
• Secondary costs of overdrawn accounts : deferred purchases, inconvenience of not having
access to money when needed
• Lost attention and bandwidth caused by spam messages, even if they are not reacted to.”
Indirect losses by spam :
• “Loss of trust in online banking, leading to reduced revenues from electronic transaction fees,
and higher costs for maintaining branch staff and cheque clearing facilities
• Missed business opportunity for banks to communicate with their customers by email
• Reduced uptake by citizens of electronic services as a result of lessened trust in online
transactions
• Efforts to clean-up PCs infected with malware for a spam sending botnet”
Defense costs of spam :
• “Security products such as spam filters, antivirus, and browser extensions to protect users
• Security services provided to individuals, such as training and awareness measures
• Security services provided to industry, such as website take-down services
• Fraud detection, tracking, and recuperation efforts
• Law enforcement
• The inconvenience of missing messages falsely classified as spam”
Considering that the large amount of spam traffic among servers cause the delay for delivering
legitimate emails ; Sorting out the unsolicited messages takes time ; Whilst in the process of
classifying messages into spam and legitimate, there is the risk of deleting an important email
by mistake, the problems resulting of spam emails makes unbearable situations for every one
who uses the Internet.
To get a better insight on the direct and indirect losses of spam, here we briefly present some
reports.
Microsoft and Google [113] estimate that spam emails cost to American firms and consumers
up to 20 billion dollars per year, whilst [83], [84] show that a successful spam campaign can
earn revenues between $400k to $1000k. [133] estimated Cutwail botnet for providing spam
4
services earns around $1.7 million to $4.2 million in one year. It has been calculated that a
company with 1000 employees, looses just $500,000 per year as productivity cost resulting
from spam messages 5 .
The most popular solution to the problem of spam is Filtering [21]. The spam filtering can
be defined as a methodology to divide messages into spam and legitimate [21]. Currently, the
most used approach for fighting spam emails consists in identifying and blocking them on
the recipient machine through filters [30], [46], [123], which generally are based on machine
learning techniques or content features [22], [138], [139].
Nevertheless that the existing filtering algorithms often show the accuracy of more than 90%
in experimental evaluations [21], [89], it does not stop spammers from imposing considerable
cost to users and companies [113]. We believe the reason could be that the spammer, the root
of the problem, feels the minimum risk to be caught or followed.
To effectively fight the problem of spam emails, it is mandatory to find and persecute the
spammers, generally hiding behind complex networks of infected devices, which send spam
emails against their user will, i.e. botnets. Due to botnets, identifying the spammer is a difficult
task, however possible [142], [149], [45].
To simplify this analysis, first of all, huge amount of spam emails are required to be divided
into spam campaigns. A spam campaign is the set of messages spread by a spammer with a
specific purpose [27], like advertising a specific product, spreading ideas, or for criminal intents
e.g. phishing. Grouping spam messages into spam campaigns reveals behaviors that may be
difficult to be inferred when we look at a large collection of spam emails as a whole [132]. It is
noteworthy to be mentioned that the problem of grouping a large amount of spam emails into
smaller groups is an unsupervised learning task. The reason is that there is no labeled data
for training a classifier in the beginning. The proposed approach for clustering spam messages
should be based on this premise that the general appearance of messages belonging to the
same spam campaign mainly remain unchanged, although spammers usually insert random
text or links [27]. The rationale behind this approach is that two messages in the same format,
i.e. similar language, size, same number of attachments, same amount of links, etc., are more
likely to be originated from the same source (spammer), belonging thus to the same campaign.
Hence, the discriminative structural features of messages required be to be selected correctly.
Furthermore, the clustering algorithm should be quite fast and effective in grouping junk
emails into spam campaigns.
Afterward to each campaign should be assigned a label describing the purpose of spammer.
This goal-based labeling facilitates for investigators the analysis of spam campaigns, eventually
directed toward a specific cybercrime. Moreover, the spam campaign labeling based on the
goal of spammer can help to rank spam campaigns.
5. http ://www.fixedbyvonnie.com/2013/08/what-is-spam-and-how-you-get-junk-email/
5
Ranking spam campaigns based on the investigator’s priorities, provides ordered set of spam
campaigns that on the base of it the investigator decides which spam campaigns must be first
analyzed, which is a difficult task when we look at large number of detected spam campaigns
as a whole.
It is not uncommon that data mining process requires several days or weeks to be completed.
Parallel computing systems bring significant benefit, say high performance, in implementation
of massive database [33]. Parallel clustering is a methodology proposed to alleviate the problem
of time and memory usage in clustering large amount of data [94], [18].
Because of the huge amount of received spam emails, which vastly increases every hour (8
billions per hour) [110], [101] and for the high variance that related emails may show, due to
the use of obfuscation techniques [108], it would be helpful if we are able to parallelize the
process of clustering in several parallel computers. Parallel clustering will speed up the process
of grouping unwanted messages into spam campaigns.
In the present thesis, we address all aforementioned issues related to spam campaign detection,
analysis, labeling, and speeding up the process through parallelism with the use of formal
methods. In what follows, the contribution of the thesis is explained in detail.
1.2
Main Contributions
The main contribution of this thesis can be summarized as following :
— We propose a categorical clustering algorithm, named CCTree, which is designed to
divide spam emails into smaller groups, named spam campaigns, based on the structural
similarity. The main hypothesis is that some parts of spam emails, belonging to the
same spam campaign, remains unchanged. The CCTree has a tree-like structure, where
the leaves of the tree represent the desired spam campaigns ([126]).
— A set of 21 categorical features are presented which characterize the structure of spam
emails. An extensible and portable framework is provided to automatically extract the
set of proposed features from raw emails. These features well represent the structure of
an email. Some of these features hardly change when a spammer creates his own spam
campaign ([129]).
— We propose and validate through analysis of 200k spam emails, a methodology to choose
the optimal CCTree configuration parameters. The proposed technique shows that once
the input parameters of CCTree are chosen for a dataset, they can be used for similar
datasets with comparable size ([129]).
— We show the effectiveness and efficiency of CCTree in clustering emails into campaigns
through two well-known evaluation indexes, named internal evaluation, i.e. the ability
of CCTree in obtaining homogeneous clusters and external evaluation, i.e. the ability to
effectively classify similar elements (emails), when classes are known beforehand ([129]).
6
— We propose a framework, named Digital Waste Sorter (DWS), which exploits a self learning goal of the spammer -based approach for spam email classification. The proposed
approach aims at automatically classifying large amount of raw unclassified spam emails
dividing them into campaigns and labeling each campaign with its spammer goals. To
this end, we proposed five class labels to group spammer goals into five macro-groups,
namely Advertisement, Portal Redirection, Advanced Fee Fraud, Malware Distribution
and Phishing ([128]).
— A ranking methodology is proposed to order sets of spam campaigns on the base of
investigator priorities. The proposed approach extract five ranking features from each
discovered spam campaign, according to investigator priorities. Including the spammergoal label of spam campaign, these features are used to automatically attribute a grade
to each spam campaign. The set of spam campaigns are ordered based on their grades.
— A semiring-based formal method, named Feature-Cluster Algebra, is proposed to abstract the representation of CCTree. The resulted term equivalent to a CCTree is called
CCTree term. Trough several theorems we prove that the proposed algebraic structure, under some conditions, fully abstracts tree representation. A rewriting system is
proposed to automatically verify whether a term is a CCTree term or not ( [127] ).
— The abstract schema of CCTree is applied to formalize CCTree parallelism. The parallelism approach can be applied to speed up the process of clustering in parallel
computers. To formalize CCTree parallelism, a set of rewriting rules are provided to
get a final CCTree from the resulted CCTrees of parallel computers. Through the set
of examples and theorems, we show how the proposed approach works.
1.3
Thesis Outline
The present thesis is structured as follows. First, we provide related work synthesis in the
effort of spam campaign detection, labeling, and formalization in Chapter 2.
In Chapter 3, we propose a categorical clustering algorithm, named CCTree, to cluster spam
emails based on structural similarity (step 1 in Figure 1.3), the result of this step is a set of
spam campaigns, which are the leaves of the CCTree (step 2 of Figure 1.3).
The effectiveness and efficiency of CCTree in spam email campaign detection is presented in
Chapter 4.
We propose a self-learning approach to label spam campaigns on the base of the goal of the
spammers (step 3,4 of Figure 1.3), and rank the labeled spam campaigns (step 5 of Figure
1.3) in Chapter 5.
The aforementioned steps are complete to divide a large amount of spam emails into spam
campaigns. On the other side, to speed up the process of clustering algorithms, one well-known
applied technique is parallel clustering. In the rest of the thesis, we formalize the CCTree
parallelism. Hence, it is possible that the whole set of data to be divided in parallel computers
7
Figure 1.3 – The framework of thesis.
(step 6,7 of Figure 1.3). In Chapter 6, we abstract the CCTree representation with the use
of a well-known algebraic structure, named semiring. We prove that the proposed algebraic
based technique abstracts tree representation. The formal representation of CCTree is named
CCTree term. We propose a rewriting system to verify whether a term is a CCTree term or
not. The CCTree term is used to formalize CCTree parallelism with the use of a rewriting
system (step 8 of Figure 1.3) . The result of final CCTree is the set of spam campaigns (step
10 of Figure 1.3), which can be delivered to previous explained parts of the framework to be
labeled and ranked. We conclude with future directions of the present thesis in Chapter 7.
8
Chapitre 2
State of the Art
In line with the growing concerns regarding spam messages, there has been an increasing
number of works dedicated to the problem, which studies the issue from different aspects. In
this chapter, we present a comprehensive literature review to the problem of spam emails,
directly or indirectly related to our work. At the end of the chapter, we present the studies
related to formal methods applied in feature models’ presentations. We refer how these formal
approaches are similar (and different) to our proposed semiring-based formalization technique
for abstracting feature-based categorical clustering algorithm, and finally to speed up the
process of clustering through parallelism.
2.1
Spam Emails Issues
In this section we explain different problems of spam emails discussed in the literature.
Botnet is one of the main topic related to spam emails, which vastly came under consideration
in recent years. [76] report that more than 85% of worldwide spam is sent by botnets 1 . The
term botnet refers to a group of campaign host computers that are controlled by a small
number of commander hosts referred to as command and control (CC) servers. Compromised
machines on the Internet are generally referred to as bots, and the set of bots controlled by a
single entity is called a botnet [153]. In other words, botnet is a network of “zombie” computers
infected by a malicious software (or “malware”) designed to enslave them to a master computer.
The malware is installed in a variety of ways, such as downloading an attachment received by
a spam email [25],[78], [35].
[146] perform a large scale analysis of spamming botnet characteristics and identify trends
that can benefit future botnet detection and defense mechanisms. The proposed framework is
based on the premise that botnet spam emails are mostly sent in an aggregate fashion, resulting in content prevalence similar to the worm propagation. The focus of research is on URLd
1. www.symantec.com
9
embedded in email content. With the use of three-month collected spam emails from Hotmail,
the proposed framework, named AutoRE, [146] found several interesting results regarding the
degree of email obfuscation, properties of botnet IP addresses, sending patterns, and their
correlation with network scanning traffic.
[79] present a platform, named Botlab, which continually monitors and analyzes the behavior
of spam botnets. The result of this study shows that six botnets are responsible for 79% of
spam messages arriving at the University of Washington campus.
[96] first discuss about the fundamental concepts of botnets, including formation and exploitation, lifecycle, and two major kinds of topologies. Several related attacks, detection, tracing,
and countermeasures, are introduced later.
[47] propose a spam zombie detection system, named SPOT (Sequential Probability Ratio
Test), which monitors outgoing messages of a network. Through a two-month e-mail trace
collected in a large US campus network, they show that SPOT is an effective and efficient
technique in automatically detecting compromised machines in a network.
[52] apply PageRank approach, with an additional clustering algorithm, to efficiently detect
stealthy botnets through peer-to-peer communication.
[133] provide interesting statistic about botnet : at two hours about 29.6% of bots are blacklisted, and 46.4% are blacklisted after three hours. By six hours, roughly 75.3% are blacklisted.
The rate reaches 90% after a period of about 18 hours.
[142], [149], [45] propose several approaches to find the botmaster through step stones.
[13], [122], [116], [107] provide a brief look at the existing botnet research, the evolution and
future of botnets, as well as the goals and visibility of today networks intersection in order to
inform the field of botnet technology and defense.
The other topic related to the problem of spam emails is about the cost of spam messages,
and the revenue of spammers.
[119] believe that any marketing based on spam emails brings the advantage of costing the
sender small. Hence, the sender send large number of messages to maximize the return.
There are several researches focusing on what spammer get back from spam campaigns. The
conversion rate of spam marketing is discussed in [83], while in [133] , [112] , and [134] the
underground economy of spam is analyzed. [133] show that spam-as-a-service can be purchased
for approximately $100–$500 per million emails sent. Botnets can also be rented to groups
interested in sending out larger amount of designed spam emails, which are capable in sending
100 million emails per day for $10,000 per month. Considering in their own study that a cutwail
operators may have paid between $1,500 and $15,000 on a recurring basis to grow and maintain
their botnet, and estimating the value of the largest email address list (containing over than
1,596,093,833 unique addresses) from advertised prices, it is worth approximately $10,000–
$20,000. Finally, the Cutwail gangs profit for providing spam services is estimated to around
$1.7 million to $4.2 million since June 2009. They also observed that several individuals offer
10,000 malware installations for approximately $300– $800, and rates for one million email
10
addresses ranging from $25 to $50, with discounted prices for bulk purchases.
[84] show that a successful spam campaign can earn revenues between $400k to $1000k.
The other side of cost effect of spammer has been evaluated as productivity cost 2 . To measure
the cost of spam emails in terms of productivity, suppose that the average money an employee
makes per year equals to $ 80k, while he is working 220 days per year. Let’s say that he
receives 100 messages per day, which 40 of them are spam, and the average time to read a
message and delete it takes 5 seconds. Then, he gets $45 per hour, and needs 3 minutes just
for deleting spam emails, he lost $2.25 per day just for checking the spam messages. It means,
if a company has 1000 employees, it looses just $500,000 per year as productivity cost resulted
from spam messages.
The other main focus of research related to the problem of spam emails refers to spam filtering
methods.
Spam filtering is based on analysis of the message contents and additional information, trying
to identify spam messages from legitimate ones [143], [21]. Generally, a spam filter is an
application which implements a function as following :
(
f (m, θ) =
C(spam)
C(leg)
if the message m is spam
if the message m is legitimate
where m is a message to be classified, and θ is a vector of parameters, and C(spam) and
C(leg) are labels assigned to the message.
Mostly spam filtering is performed with the use of machine learning algorithms, e.g. applying
Naive bayesian approaches [9],[8], and other classifiers [75], [151], [90], [22], [138], [139]. The
approaches proposed in the literature for filtering spam emails constitute a variety of topics.
[29] presents an overview of approaches aimed at spam filtering. Text analysis and characterizing spam emails with the use special words, was another applicable approach in the field of
spam filtering. To this end, [48] apply lazy learning algorithms to tackle concept drift in spam
filtering, while [80] use n-grams in an anti-spam approach based on words. Spammers start
to obfuscate text in spam messages, or embed the text in images, to avoid being identified
trough text filtering techniques. Image spam filtering methodologies [10], [20], came under
consideration to block these kinds of spam messages.
Nevertheless, despite the growing research on spam filtering, often showing accuracy of above
than 90% [21], the evolution of spam messages is still considerable. Actually, a filter prevents
end-users from wasting their time on junk messages, but it does not stop resources misuse,
since however the messages are delivered [21].
We believe the reason could be that the spammer, the root of the problem, feels that there is
the minimum risk to be caught.
2. http ://www.fixedbyvonnie.com/2013/08/what-is-spam-and-how-you-get-junk-email/
11
To effectively fight the problem of spam emails, it is mandatory to find and persecute the
spammers, generally hiding behind complex networks of infected devices which send spam
emails against their user will, i.e. botnets. Due to botnets, identifying the spammer is a difficult
task, however possible [142], [149], [45]. To this end, first of all it is required that efficiently
and effectively divide huge amount of spam emails in the direction of being helpful to caught
the spammer.
2.2
Clustering Spam emails into Campaigns
Detecting a spammer, analyzing his behavior, deciding which spammers among all have the
priority to be followed, constitutes an extremely challenging task, due to the huge amount of
spam emails, which vastly increases every hour (8 billions per hour) [110] [101] and for the
high variance that related emails may show, due to the use of obfuscation techniques [108].
To this end : ;
• First of all, a fast and effective clustering algorithm is required to divide huge amount of
spam messages into smaller groups, each representing a spam campaign, originated from
the source (spammer).
In the research field of spam emails, several works exist which cluster spam emails into spam
campaigns.
The basic idea in [87] for identifying spam campaigns is based on the keywords or string
standing for specific types of campaigns. For example, all templates containing the string
linksh are defined as a type of self-propagation campaigns. Several campaign types, related
to the same spammer purpose, constitute a campaign class. The purpose of a spam campaign
is identified on the base of keywords in the text or subject. The set of messages containing no
text, and just the feature, belong to the image campaign. Finally, 10 spam campaign classes
are presented, named 1) Image spam, 2) Job ads, 3) Other ads, 4) Personal ad, containing fake
dating/matchmaking advance money scams, 5) Pharma containing pointers to web sites selling
Viagra, Cialis, etc, 6) Phishing, which forces victims to enter sensitive information 7 ) Political
campaigning 8) Self-prop, i.e. the spam messages which tricks victims into executing Storm
binaries 9) Stock scam that ricks victims into buying a particular penny stock 10) (Other)
Manual selection of keywords needs too much efforts iteratively, while the spammers soon by
soon change the keywords that they use. Moreover, spammer continuously fight keywordsbased approaches by means of obfuscation techniques.
It has been inferred by [87] that 65 percent of instances last less than 2 hours and the longest
existing ones are pharmaceutical which were available for months, and crucial self propagation
working for 12 days.
Three large campaigns, named Pharma, self-propagation and stock storm have the large num-
12
ber of unique headers in template, but Pharma and self propagation have actually few different
bodies. The authors suggest that may be it is better to focus on clustering on headers to identify these three campaigns and then try to identify other campaigns using other techniques.
In [54], although the authors focus on analysis of spam URLs in Facebook, the study of URLs
and clustering spam messages is similar to our goal concerning spam emails. First, all wall
posts that share the same URLs are clustered together, then the description of wall posts
are analyzed and if two wall posts have the same description their clusters are merged. In
this study factors like bursty activity and distributed communication have also come under
consideration.
The distributed property in sending spam emails refers to the number of users who send spam
messages in the cluster and in this case is usually computed from IP addresses of the senders,
while in facebook spam messages it refers to users‘ unique ID.
The bursty property comes from the rational that most spam campaigns are involved in an
action within a short period of time.
The threshold values for distributed and bursty properties in this study has been identified
as 5 and 1.5 hours, respectively. This means that if a spammer sends spam messages to less
than 5 different accounts or the interval of sending messages is greater than 1.5 hours, he is
considered as a person who have no important effect in the system.
Furthermore, the authors found that for attracting people attention, the spammers techniques
can mostly (88.2) be classified into three types : 1) They promise free gifts, 2) They use some
phrases to trigger the curiosity, like some one likes them, 3) They describe a product for sale.
It has been discovered that approximately 80 percent of malicious accounts are active less
than one hour and about 10 percent are active for longer than one day. According to each
time zone most malicious wall posts were sent around 3 am to avoid detection, and among 187
million wall posts of 3.5 million facebook users, 200,000 malicious wall posts were attributed
to 57,000 malicious accounts.
[92] believe that spam emails with identical URLs are highly clusterable and mostly sent in
burst. In their method, if the same URL exists in spam emails from source A and source B,
and each has a unique IP address, they will be connected with an edge to each other and
the connected components are the desired clusters. It is also observed that if a spammer is
associated with multiple groups, it has a higher probability of sending more spam mails in
the near future. Furthermore, the authors found a very small fraction of the active spammers
actually accounted for a large portion of the total spam mails. Furthermore, they inferred that
the spam emails from the same group of spammers are sent in burst.
Spamscatter [4] is a method that automatically clusters the destination web sites extracted
from URLs in spam emails with the use of image shingling. In image shingling, images are
divided into blocks and the blocks are hashed, then two images are considered similar if 70
13
percent of the hashed blocks are the same. The life time of each detected spam campaign is
computed through finding the first (in terms of time) and last (in terms of time) spam message
in the spam campaign. The result shows that over 40% of the malicious scams persist for less
than 120 hours, whereas the lifetime for the same percentage of shopping scams is 180 hours
and the median for all scams is 155 hours.
[150] cluster spam messages based on the images of spam to trace the origins of spam emails. To
this end, spam images are divided into two parts : foreground and background. The foreground
comprises the text and/or illustrations while background is the colors and/or textures. The
spam emails are visually similar if their illustrations, text, layouts, and/or background textures
are similar. In this study, spam images are separated to foreground and background, where
the foreground contains the text and illustration, and the background means various colors
and textures. The two-stage clustering, first with the use of Optimal Character Recognition
recognizes texts whose bounding boxes represent the text layout. Afterwards, the illustrations
are separated from the background by detecting the background. The authors mention that
the proposed approach requires to be mixed with other methods to get better result.
[130] focus on clustering spam emails based on IP addresses resolved from URLs inside the
body of these emails. The rational behind it is that the authors believe in many cases it is not
easy to change the IP addresses easily, since it requires to compromise a lot of computers. In
this study, two emails belong to the same cluster, if their IP addresses resolved from URLs are
exactly the same. Afterwards, the relationship between spam sending system and malicious
Web servers connected to URLs , and also some information like the number of unique URLs,
unique domain names, etc are provided.
By examining three weeks spam messages gathered on used SMTP server, the authors conclude
that the proposed methodology outperforms comparing to clustering techniques based on
domain names and URLs, while the claim is justified due to the fact that domain names
associated with the scam changes frequently, also the period that a URL is active is too short
for performing the investigation, and most of the time the URLs used in spam emails are
unique.
In all aforementioned works for clustering spam emails into campaigns, the pairwise comparison
of each two email is required, where the time complexity is quadratic. Furthermore, the spam
campaign detection is limited to one or two features in spam emails, where if the spam messages
does not contain the related feature, the methodology fails in its clustering. For example, for
emails without URL or without images, the approaches of [130] , [150] fail, respectively.
Other limitations of the former approaches have been identified in [132], which shows how
only considering IP addresses resolved from URLs is insufficient for dividing emails in spam
campaigns. More precisely, since web servers contain lots of domains with the same IP address,
every spam campaign identified by such a mean (such as [130]) are instead made of a large
14
amount of spam emails sent by different controlling entities.
Thus, [130] propose a new technique for spam campaign detection, named O-means clustering, which is based on K-means clustering algorithm. The distance of two spam messages is
calculated based on 12 features extracted from emails, which are expressed by numbers and
the distance is computed with the use of euclidean measurement. The set of 12 features are
1) size of email, 2) number of lines, 3) number of unique URLs, 4) average length of unique
URLs, 5) average length of domain names, 6) average length of query, 7) average number of
key values pair, 8) average length of path, 9)average length of keys, 10) average length of
values, 11) average number of dots in domains, 12) number of global top 100 URLs.
The limitation of O-means is that it requires the number of clusters to be known from beginning, which is generally not a working hypothesis. On the other hand, the applied features are
considered numerical, not representing well the reality, specially for considering the distance
of two emails based on the the number of links numerically, i.e. the two email with one link
be considered closer to the email with 10 links rather than the one with 11 links.
After clustering spam emails according to O-means method, [131] found that the 10 largest
clusters had sent about 90 percent of all spam emails. Hence, the authors investigate these 10
clusters to implement heuristic analysis for selecting significant features among 12 features used
in previous work. As a result they select four most important features which could effectively
separate these 10 clusters from each other. These features are : “Size of emails”, “Number of
lines”, “Length of URLs” and “Number of dots”. However, the authors mentioned that it is
not the best method for selecting the most significant features, since it was based on analysis
of the top 10 clusters. By the way, it results almost with the same accuracy of clustering of
the previous method which used 12 features. The accuracy ranges from 86.63 percent to 86.33
percent, which the difference is negligible but the execution time from 28,772 sec decreases to
6,124 sec.
[144] first extract eleven features of each spam email. This set of features includes : “Message
Id”, “Sender IP address”, “Sender Email”, “Subject”, “Body Length”, “Word Count”, “Attachment File Name”, “Attachment MD5”, “Attachment Size”, “Body URL”, “Body URL Domain”,
while some attributes are broken down into two sub-attributes, for example, “body URL” into
“Machine Name” and “Path”.
Afterwards, two clustering algorithms are applied to divide spam messages. At first an agglomerative hierarchical algorithm [66] is used to cluster the whole data set based on messages’
subject comparison. This means that at the beginning, each email is a cluster by itself and
then clusters sharing common subject are merged. The distance D(i, j) between two clusters
i and j is equal to 0 if they share common feature of an attribute and equal to 1 if not. Thus,
when the distance between two clusters is 0, the two clusters are merged. Finding that with
first merge based on the subject, 67% of messages are attributed to one cluster. To solve the
15
problem of false positive rate for big clusters, the connected component with weighted edges
algorithm is applied. A connected component [12] is an undirected graph in a set A of vertexes
such that for each vertex v ∈ A, the set of vertexes for which there exists a path from v
to them is exactly the set A. The weight on edges represents the strength of the connection
between two vertexes. Applying this approach, edges connect two spam emails based on the
eleven attributes. The desired clusters are the connected components of this graph with the
weight above a specified threshold.
The main drawback of this methodology is that it cannot be applied on large datasets, since
the pairwise comparison are done for pair of emails in the dataset several times.
The basic hypothesis in [27] for clustering spam emails is that some parts of spam messages
are static in the view of recognizing a spam campaign. In this work, as an improvement of
[92], just URLs are not considered for clustering. In this work, for identifying spam campaigns
some features extracted from spam emails, named “language of email”, “message layout”, “type
of message”, “URLs” and “subject”. Afterwards, the frequencies of proposed features in a large
dataset are computed in order to cluster spam messages with the use of FP-Tree. Frequent
Pattern Tree (FP-Tree), proposed by [67], is a signature based method in which each node
after the root depicts a feature extracted from the spam message that is shared by the subtrees beneath. Thus, each path in this tree shows sets of features that co-occur in messages,
with the property of non-increasing order of frequency of occurrences.
Applying FP-Tree for spam campaign detection, in [27] and [44], has several limitations. First
of all, in the side of URL similarity, since each token of a URL is considered as a feature,
it fails to distinguish dynamic URLs in emails belonging to the same campaign [27]. On the
other hand, considering token of URLs as feature causes that a spam email containing several
URLs be directed to several campaigns.
Moreover, in the side of layout detection, FP-Tree is too much sensitive to very small changes
in the layout. More precisely, FP-Tree reads each message line by line, and then the layout is
provided as the string of letters, e.g. UTBUUB, where the i’th letter in the string represents
the i’th line of spam message, e.g. if U occurs in the first letter of layout string, it means that
in the first line of message we have URL. Considering that spammers use several techniques
for random text and URL obfuscations, it is possible that two very similar emails, belonging
to the same spam campaign, be considered as having two different layouts in FP-Tree, just
because the random text reaches to the next line in one email whilst not in the other one.
In summary, the previous works for clustering spam emails mainly could be divided into two
main categories : the first group focus on pairwise comparison of each pair of emails, for
example URL comparison, and the second group in which a clustering algorithm is used,
for example O-means clustering. In general, the aforementioned previous works suffer from
one of the following problems : 1) They consider one or two features for grouping spam
16
messages, which decreases the accuracy, 2) The pairwise comparison is used, with quadratic
time complexity, 3) The number of clusters is required as a former knowledge, 4) The features
which create a pure cluster are not focused. In our proposed methodology for clustering spam
emails into campaigns, we try to address the aforementioned problems.
2.3
Labeling and Ranking Spam Campaigns
• In the next step, to address the spam message problem, an approach is required to label
detected spam campaigns in order to train a classifier with the use of labeled set of
messages, and then to investigate an order among detected spam campaigns according
to investigator priorities.
In the literature, the spam campaigns are usually labeled based on characteristic strings (keywords) representing individual campaign types as in [44], [88] and [55]. As explained, in these
works, the occurrence of some specific string in a spam message means that the spam is labeled as a pre-identified type spam campaign. For example, all templates containing the string
linksh are defined as a type of self-propagation campaigns. First of all, manual string selection
requires a lot of time, while the spammers soon by soon change the set of words in the body of
messages applying obfuscation techniques. Moreover, it is worth noticing that many spammer
apply the same words, like “viagra”, to deceive the victims. Hence, training a classifier based
on the words label is not helpful in spam campaign detection, while the spam campaign is
defined according to our need, i.e. originated from the same source.
[106] label spam campaigns on the base of contact information in the body of messages. To this
end, URLs, phone number, Skype ID, and Mail ID used as contact information are considered
for clustering spam emails into similar groups, whilst the contact information is considered
as the label of detected spam campaign. This methodology is effective only against emails
reporting contacts, which are only a subset of all the spam emails found in the wild.
There are several approaches in the literature in which the spammer goal is considered. However, these approaches are mainly focused on detecting phishing emails, not considering other
spammer purposes. Phishing email [3] as a special type of spam message, has become an
enormous threat for all Internet based commercial operations, which causes non negligible
financial losses to organizations and individual users. Phisher attempts to redirect users to
fake websites, which is designed to obtain financial data such as usernames, passwords, and
credit card detail, etc of a person illegally in an electronic communication [3].
In this regard, mostly the set of features which represent a phishing email structure are proposed, and then a machine learning algorithm is used to classify set of emails into phishing or
legitimate.
[50] applied 10 email features to discern phishing emails from ham (good) emails. These 10
17
features include : 1) IP-based URLs, 2)age of linked-to domain names, 3) nonmatching URLs,
4)“Here” links to non-model domain, 5) HTML emails, 6) number of links, 7) number of
domains, 8) number of dots, 9) containing javascript, 10) spam-filter output.
[17] propose a similar methodology with additional features to train a classifier in order to
filter phishing emails. Advanced email features are generated by adaptively trained Dynamic
Markov Chains and latent Class-Topic Models. The set of features are divided into three main
groups, named basic features, dynamic markove chain features, latent topic model features.
Basic features by itself contain several features, e.g. structural features, link features.
[34] propose a methodology to detect phishing emails based on both machine learning and heuristics. The proposed novel heuristic anti-phishing system employs Gestalt and decision theory
concepts in modeling the similarity. [3] provide a survey on different techniques in filtering
phishing emails, while Gansterer et al. [53] compare different machine learning algorithms in
phishing detection. Furthermore, the authors propose a technique which refines the previous
phishing filtering approaches. In this work, three types of messages, named ham, spam and
phishing are distinguished automatically. Nevertheless, the category of emails containing spam,
is not precisely characterized.
There are number of works discussing on different aspects of spam email attacks, spanning
from the network of malware distribution [104] , PageRank spam analysis [1] to total revenues
for a range of spam advertised campaigns [84], [83]. However, in these works also some specific
aspects of one type of spam attack is analyzed, where the detection of different types of spam
attacks is not discussed.
In the side of ranking spam campaigns, [44] consider Canadian law enforcement elements, e.g.
Canadian IP addresses, “.ca” top-level domain names, and IP ranges of Canadian IP addresses.
To the best of our knowledge, the present work is the first effort in labeling spam campaigns
based on the different goals of spammer based on the structural features of messages, whereas
the goal-based label of each campaign is applied to order the set of detected labeled spam
campaigns.
2.4
On the Formalization of Clustering and its Applications
• As the next step, we formalize CCTree, as the effective and efficient categorical clustering
algorithm. The formal schema is used to formalize CCTree parallelism with the use of
rewriting system.
It is hard to find studies in the literature on the formalization of different concepts related to
clustering algorithms.
[58] formalize hierarchical clustering as an Integer Linear Programming (ILP) problem with a
18
natural objective function. The dendrogram properties of hierarchical clustering are enforced
as linear constraints. The proposed formalization technique has the benefit of that relaxing
the constraints may provide novel program variation, like overlapping clusterings.
[103] formally define the problem of clustering in Multi-Criteria Decision Aid (MCDA) system.
As in most MCDAs, the preferences of a decision maker are modeled based on a set of decision
alternatives. To find the optimal solution, the authors propose a heuristic approach, which is
validated trough tests on a large set of artificially generated benchmarks.
[2] propose an approach to formalize the problem of data streams in clustering algorithms,
based on the set theory. Data stream refers to infinite sequences of data. The formalization
scheme made it possible to identify and propose basic properties for the design and comparison
of data stream clustering algorithms. To this end, they extended Kleinberg’s properties [86]
to represent clustering partitions evolving according to the data stream behavior. They found
that it is difficult to find an algorithm to comply with expressiveness property in a data stream
context.
[41] apply predicate logic language in terms of sets of if-then rules to formalize heuristic
rules in clustering algorithms. In this approach, it is possible to describe traditional clustering
algorithms, like k-means. However, in none of the few number of works on formalizing clustering
algorithms, algebraic methodology is used in abstracting a clustering algorithm representation.
In what follows we present several techniques and methodologies used to formalize feature
models.
Feature models are information models in a way that a set of products, e.g. software products
or DVD player products, are represented as hierarchically arrangement of features, with different relationships among features [15]. Feature models are used in many applications as the
result of being able to model complex systems, being interpretable, and the ability to handle
both ordered and unordered features [105]. Benavids et al. [15] believe designing a family of
software systems in terms of features, makes it easy to be understood by all stakeholders,
rather than the time they are expressed in terms of objects or classes. Representing feature
models as a tree of features, were first introduced by Kang et. al in [82], to be used in software product line. Some studies [31], [32], show that tree models combined with ensemble
techniques, lead to an accurate performance on variety of domains. In feature model tree, differently from CCTree, the root is the desired product, the nodes are the features, and different
representation of edges demonstrates the mandatory or optional presence of features.
Hofner et al. [73], [74], were the first who applied idempotent semiring as the basis for the
formalization of tree models of products, and they called it feature algebra. The concept of
semiring is used to answer the needs of product family abstract form of expression, refinements, multi-view reconciliation, and product development and classification. The elements of
semiring in the proposed methodology, are sets of products, or product families.
19
To get better insight on how feature algebra works, we present a brief history of product
family from definition to formalization. Furthermore we explain that despite our inspiration
from the concept of feature algebra in formalizing tree model system, our proposed approach
is different in several aspects.
FODA used feature models as the means to give the mandatory, optional and alternative
concepts within a domain [81], [115]. For example, in a car, the transmission system is a
mandatory feature, and an air conditioning is an optional feature, whilst the transmission
system can either be manual or automatic. The part of the FODA feature model most related
to formalizations works is the proposed feature diagram. It builds a tree of features and
captures the mandatory, optional, and alternative relationships among features.
[82] perform an analysis of commonalities among applications in a particular domain in terms of
services, operating environments, domain technologies and implementation techniques. Afterwards, they construct a model named feature model to capture commonalities as an AND/OR
graph. The AND nodes in this graph demonstrate mandatory features and OR nodes show
alternative features chosen from different applications.
[39] proposed a feature model represented by a hierarchically arranged diagram where a parent
feature is composed of a combination of some or all of its children. A vertex parent feature
and its children in this diagram can have one of the following relationships :
– And relationship, which indicates that all children must be considered in the composition
of the parent feature
– Alternative relationship, which indicates that only one child forms the parent feature
– Or relationship, which shows that one or more children features can be involved in the
composition of parent feature
– Mandatory relationship, which indicates that children features are required
– Optional relationship, which shows that children features are optional.
Lopez-Herrejon, Batory, and Lengauer model features as functions and feature composition
as function composition [97] [95]
To get better insight how feature algebra works, we refer to an example of product line,
provided in [24]. Suppose that an electronic company have a family of three product lines :
mp3 Players, DVD Players and Hard Disk Recorders. All members share the set of features
given in the Commonalities. A member can contain some mandatory features and might
contain some optional features that another member of the same product line do not have.
For instance, a product could be a DVD Player that is able to play music CDs, whilst the
other one does not have this feature. However, all the DVD players of the DVD Player product
line must contain the Play DVD feature. Furthermore, it is possible to have a DVD player
that is able to play several DVDs at the same time.
20
Different researchers have proposed different views of what a feature is or should be. A definition that is common to most (if not all) of them in Feature-Oriented Software Development
(FOSD) is that “a feature is a structure that extends and modifies the structure of a given
program in order to satisfy a stakeholder’s requirement, to implement a design decision, and
to offer a configuration option” [72].
Mostly, a set of features are composed to create a final program, which is itself considered
as a feature. Under this assumption, a feature is either a complete program which can be
executed or a program increment that requires further features to lead to a complete program.
The structure of a basic feature is modeled as a tree, called feature structure tree (FST),
which builds the feature’s structural elements, e.g., classes, fields, or methods, hierarchically.
A specified name and type information is assigned to each node of an FST, which helps to
prevent the composition of incompatible nodes during feature composition [72].
The concept of product families entered from hardware industry to the software development
process [72]. The reason was that the software developers also prefer not to build just a
single product but a family of similar products, sharing some functionalities, whilst having
some well-identified variabilities. These elements, known as features, in software family can be
characterized as requirements, architectural properties, components, middleware, or code. Due
to the fact that the systems are characterized by their features, in [72] the authors call their
proposed methodology feature algebra. Idempotent semirings is the basis of feature algebra,
which allows a formal treatment of the aforementioned elements as well as the calculations with
them. Sets of products are particular models of proposed feature algebra, which in its extension
form covers product lines, refinement, product development and product classification.
The tree-like structure which is formalized in product family problems has different structure
from CCTree. In product family structure, against CCTree, the edges of the tree have no labels,
only the nodes have ones. Furthermore, different representations of edges convey different
concepts, whilst in CCTree we do not have different possible edge representations.
To the best of our knowledge, we are the first to apply an algebraic structure to abstract a
categorical clustering algorithm representation and formalize the interesting concepts related
to it, i.e. clustering parallelism. To this end, we attribute an algebraic representation of a tree
structure and then trough several theorems and examples we show the proposed abstraction
algebraic term fully abstract tree representation. Calling the term resulted from CCTree, as
CCTree term , a rewriting system is proposed to automatically verify whether a term represents
CCTree structure or not. Furthermore, a set of rewriting rules are provided to parallelize the
result of parallel clustering.
21
Chapitre 3
Spam Campaign Detection
Spam emails constitute a fast growing and costly problems associated with the Internet today.
To fight effectively against spammers, it is not enough to block spam messages. Instead, it is
necessary to analyze the behavior of spammer and catch them in the case. This analysis is
extremely difficult if the huge amount of spam messages is considered as a whole. Clustering
spam emails into smaller groups according to their inherent similarity, facilitates discovering
spam campaigns sent by a spammer, in order to analyze the spammer behavior. In this chapter,
we propose a methodology to group large amount of spam emails into spam campaigns, on
the base of categorical attributes of spam messages. A new informative clustering algorithm,
named Categorical Clustering Tree (CCTree), is introduced to cluster and characterize spam
campaigns. The complexity of the algorithm is also analyzed and its efficiency is proved ([126]).
3.1
Introduction
Nowadays, the problem of receiving spam messages leaves no one untouched. According to
McAfee [100] report, out of the daily 191.4 billions of emails sent worldwide in average [110],
more than 70% are spam emails. Microsoft and Google [113] estimate spam emails cost to
American firms and consumers up to 20 billion dollars per year. Moreover, Cisco Report [136]
shows that spam volume increased 250 percent from January 2014 to November 2014. Spam
emails cause problems, from direct financial losses to misuses of traffic, storage space and
computational power.
Given the relevance of the problem, several approaches have already been proposed to tackle
this issue. Currently, the most used approach for fighting spam emails consists in identifying
and blocking them [30], [46], [123], on the recipient machine through filters, which generally
are based on machine learning techniques or content features [22], [138], [139]. Alternative
approaches are based on the analysis of spam botnets [79],[91],[146], [152].
Though some mechanisms to block spam emails already exist, spammers still impose non
22
negligible cost to users and companies [113]. Thus, the analysis of spammers behavior and the
identification of spam sending infrastructures is of capital importance in the effort of defining
a definitive solution to the problem of spam emails.
Such an analysis, which is based on structural dissection of raw emails, constitutes an extremely
challenging task, due to the following factors :
— The amount of data to be analyzed is huge and growing too fast every single hour.
— Always new attack strategies are designed and the immediate understanding of such
strategies is paramount in fighting criminal attacks brought through spam emails (e.g.
phishing).
To simplify this analysis, huge amount of spam emails should be divided into spam campaigns.
A spam campaign is the set of messages spread by a spammer with a specific purpose [27], like
advertising a specific product, spreading ideas, or for criminal intents e.g. phishing. Grouping
spam messages into spam-campaigns reveals behaviors that may be difficult to be inferred
when we look at a large collection of spam emails as a whole [132]. According to [27], in order
to characterize the strategies and traffic generated by different spammers, it is necessary to
identify groups of messages that are generated following the same procedure and that are part
of the same campaign.
It is noteworthy to be mentioned that the problem of grouping a large amount of spam emails
into smaller groups is an unsupervised learning task. The reason is that there is no labeled data
for training a classifier in the beginning. More specifically, supervised learning requires classes
to be defined in advance and the availability of a training set with elements for each class.
In several classification problems, this knowledge is not available and unsupervised learning is
used instead. The problem of unsupervised learning refers to trying to find hidden structure
in unlabeled data [57]. The most known unsupervised learning methodology is clustering.
Clustering is an unsupervised learning methodology that divides data into groups (clusters)
of objects, such that object in the same group are more similar to each other than to those in
other groups [77].
However, dividing spam messages into spam campaigns is not a trivial task due to the following
reasons :
— Spam campaign classes are not known beforehand, which means we need an unsupervised machine learning technique.
— Feature extraction is difficult. Finding the elements that best characterize an email is
an open problem addressed differently in various research works [50], [17], [150], [132].
For these reasons the most used approaches to classify spam emails is clustering them on the
base of their similarities [4], [111], [132].
However, the accuracy of current solutions is still somehow limited and further improvements
are needed. While some categorical attributes, for example the language of spam message, are
primary, discriminative and outstanding characteristics to specify a spam campaign, neverthe-
23
less in previous works [87], [92], [4], [130], [131],[144], [28], these categorical features are not
considered, or the homogeneity of resulted campaigns are not on the base of these features.
In this chapter, after a thorough literature review on the clustering and classification of spam
emails, we propose a preliminary work on the design of a categorical clustering algorithm for
grouping spam emails, which is based on structural features of emails like language, number
of links, email size etc. The rationale behind this approach is that two messages in the same
format, i.e. similar language, size, same number of attachments, same amount of links, etc., are
more probable to be originated from the same source, belonging thus to the same campaign.
To this aim, we expect to extract categorical features (attributes) from spam emails, which are
representative of their structure and that should clearly shape the differences between emails
belonging to different campaigns.
The proposed clustering algorithm, named Categorical Clustering Tree (CCTree), builds a
tree starting from a whole set of spam messages. At the beginning, the root node of the tree
contains all data points, which constitutes a skewed dataset where non related data are mixed
together. Then, the proposed clustering algorithm divides data points, step-by-step, clustering
together data that are similar and obtaining homogeneous subsets of data points. The measure
of similarity of clustered data points at each step of the algorithm is given by an index called
node purity. If the level of purity is not sufficient, it means that the data points belonging to
this node are not sufficiently homogeneous and they should be divided into different subsets
(nodes) based on the characteristic (attribute) that yields the highest value of entropy. The
rationale under this choice is that dividing data on the base of the attribute which yields
the greatest entropy helps in creating more homogeneous subset where the overall value of
entropy is consistently reduced. This approach, aims at reducing the time needed to obtain
homogeneous subsets. This division process of non homogeneous sets of data points is repeated
iteratively till all sets are sufficiently pure or the number of elements belonging to a node is
less than a specific threshold set in advance. These pure sets are the leaves of the tree and will
represent the different spam campaigns.
The usage of categorical attribute is crucial for the proposed approach, which exploits the
Shannon Entropy [125], which yields good results on nominal attributes.
After detailing the CCTree algorithm and briefly presenting categorical features for categorizing spam emails, we will discuss the algorithm efficiency proving its linear complexity.
The rest of this chapter is structured as follows. Section 3.2 provides some preliminary notions
of the topic. Section 3.3 reports a literature review concerning the previous techniques used for
clustering spam emails into campaigns. In Section 3.4, we describe the proposed categorical
clustering algorithm for clustering spam messages. In Section 3.5 the analysis of the proposed
methodology is discussed. Finally, Section 3.6 is a brief conclusion and a sketch of some future
directions.
24
3.2
Preliminary Notions
In this section we briefly present some preliminary notions required to be known in our proposed process for clustering spam emails into campaigns.
Clustering Let X be a dataset which consists data points (or objects, instances, cases, patterns, tuples, transactions, elements) xi = (xi1 , xi2 , . . . , xid ) in attribute space A, i.e.
each xij ∈ A, 1 ≤ i ≤ n, 1 ≤ j ≤ d where n is the number of points belonging to
X and d is the number of attributes. Furthermore, each xij is numerical or categorical
attribute (or feature, value, component). Such a point-by-attribute data representation
conceptually corresponds to a matrix. The ultimate goal of clustering [18] is to assign
points to a finite set of k subsets C1 , C2 , . . . , Ck , named clusters. Usually subsets do not
intersect (where this assumption is sometimes violated), and their union is equal to a
full dataset with possible exception of outliers :
X = C1 ∪ C2 ∪ . . . ∪ Ck ∪ Coutlier , Ci ∩ Cj = ∅, ∀1 ≤ i, j ≤ k
Clustering groups data points into subsets in such a manner that similar instances are
grouped together, while different points belong to different groups [117]. Due to the fact
that clustering is grouping similar instances, it means that some sort of measure that
can determine whether two objects are similar or dissimilar is required.
Many clustering techniques use distance measures to determine the similarity or dissimilarity between any pair of objects. The distance between two points xi and xj is usually
shown as d(xi , xj ). A valid distance measure should be symmetric and get the minimum
value (usually zero) in the case of identical vectors. The distance measure is called a
metric distance measure if it also satisfies the following properties :
d(xi , xk ) ≤ d(xi , xj ) + d(xj , xk ) ∀xi , xj , xk ∈ X
d(xi , xj ) = 0 ⇔ xi = xj
∀xi , xj ∈ X
Shannon Entropy In information theory, entropy is a measure of the uncertainty of a random variable. More specifically the Shannon entropy [125], as a measure of uncertainty,
for a random variable X with N outcomes {x1 , x2 , . . . , xN } is defined as follows :
H(X) = −
k
X
p(xi ) log(p(xi ))
i=1
where p(xi ) =
Ni
N ,
Ni is the number of outcomes of xi , and N is the total number of
elements of X.
The amount of Shannon entropy is maximal when all outcomes are equally likely, i.e.
the number of elements for each value is almost the same, and it gets its minimum, i.e.
zero, when all data belonging to a set are identical. Thus, the more closer to zero the
more pure is a dataset.
25
To get better insights how shannon entropy works in returning the purity of a data set,
Figures 3.1 and 3.2 are provided. From the first glance, it is clear that the dataset 2 is
more pure or homogeneous than dataset 1. In the following two equations, we can see
that shannon entropy returns the minimum possible amount, i.e. zero, for the complete
pure dataset 2.
Figure 3.1 – dataset 1
F igure3.1 : H(dataset1) = −(0.4 log(0.4) + 0.3 log(0.3) + 0.3 log(0.3)) = 0.4729
10
10
F igure3.2 : H(dataset2) = −( log( )) = 0
10
10
[38] and [93] show that entropy works well as a measure distance in clustering algorithms.
Spam Campaign A spam campaign is the set of messages spread by a spammer with a
specific purpose [27], like advertising a specific product, spreading ideas, or for criminal
intents e.g. phishing. The premise in our spam campaign detection, as [27], is based on
the fact that the spammers generally keep some parts of the message static, whilst some
other parts are changed systematically with automated text, image, or dynamic link
generation.
To get better insight of how two spam emails belong to the same campaign, we refer the
reader to the Figures 3.3 and 3.4. Although in these two emails, the text, images, and
dynamic links are different, it is obvious they both are generated from the same source,
or designed by the same spammer. The rational behind our spam campaign detection is
focusing on features that almost remain unchanged, when a spam campaign is created,
e.g. the language of message, the number of images, etc.
Figure 3.2 – dataset 2
26
Figure 3.3 – Spam 1
Figure 3.4 – Spam 2
27
3.3
Related Works
To the best of our knowledge just a few works exist related to the problem of clustering spam
emails into campaigns.
In [87], the basic idea for identifying campaigns is the keywords standing for specific types
of campaigns. In this study, at first campaigns are found manually based on keywords and
then some interesting results are extracted from groups of campaigns. As the result of needing
manual scanning of spam, it is not suitable to be used for large amount of data set. In [54],
although the authors focus on analysis of spam URLs in Facebook, the study of URLs and
clustering spam messages is similar to our goal concerning spam emails. First, all wall posts
that share the same URLs are clustered together, then the description of wall posts are analyzed and if two wall posts have the same description their clusters are merged. In [92], the
authors believe that spam emails with identical URLs are highly clusterable and mostly sent
in burst. In their method, if the same URL exists in spam emails from source A and source B,
and each has a unique IP address, they will be connected with an edge to each other and the
connected components are the desired clusters. Spamscatter [4] is a method that automatically clusters destination web sites extracted from URLs in spam emails with the use of image
shingling. In image shingling, images are divided into blocks and the blocks are hashed, then
two images are considered similar if 70 percent of the hashed blocks are the same. In [150], the
spam emails are clustered based on their images to trace the origins of spam emails. They are
visually resembled if their illustrations, text, layouts, and/or background textures are similar.
J. Song et al. [130] focus on clustering spam emails based on IP addresses resolved from URLs
inside the body of these emails. Two emails belong to the same cluster, if their IP address
sets resolved from URLs are exactly the same. In previous works, pairwise comparison of each
two emails is required for finding the clusters. This kind of comparison has two problems : the
time complexity is quadratic, which is not suitable for big data clustering, and furthermore
finding clusters is based on just one or two features of messages, which causes the decreasing
of precision. In what follows, spam emails are grouped with the use of clustering algorithms.
In [132], the same authors of [130] mentioned that only considering IP addresses resolved from
URLs is insufficient for clustering. Since web servers contain lots of Web sites with the same IP
address, so each IP cluster in [130] consists of a large amount of spam emails sent by different
controlling entities. Thus, the authors clustered spam emails by IP addresses resolved from
URLs in their new method, called O-means clustering, which is based on K-means clustering
method. The distance is based on 12 features in the body of an email which are expressed
by numbers and the euclidean distance is used to measure the distance between two emails.
In [131], after clustering spam emails according to O-means method, the authors found that
10 largest clusters had sent about 90 percent of all spam emails in their data set. Hence, the
authors investigate these 10 clusters to implement heuristic analysis for selecting significant
features among 12 features used in previous work. As a result they select four most important
28
features which could effectively separate these 10 clusters from each other. Since the idea
for clustering is based on k-means clustering, computationally NP-hard algorithms. Also it
requires the number of clusters to be known from beginning.
In [144] the authors focus on a set of eleven attributes extracted from messages to cluster
spam emails. Two clustering methods have been used : the agglomerative hierarchical algorithm
clusters the whole data set. Next, for some clusters containing too many emails, the connected
component with weighted edges algorithm is used to solve the problem of false positive rate.
With the use of agglomerative clustering [66] a global clustering is done based on common
features of email attributes. In the beginning, each email is a cluster by itself and then clusters
sharing common features are merged. In this model, edges connect two nodes (spam emails)
based on the eleven attributes. The desired clusters are the connected components of this graph
with the weight above a specified threshold. This method suffers from not being useful for large
data set. The pairwise comparison requires quadratic time complexity. The basic hypothesis
in FP-Tree method [27] for clustering spam emails is that some parts of spam messages are
static in the view of recognizing a spam campaign. In this work as an improvement of [92],
just URLs are not considered for clustering.
For identifying spam campaigns, Frequent Pattern Tree (FP-Tree) as a signature based method, is constructed from some features extracted from spam emails. These features are :
language of email, message layout, type of message, URL and subject. In this tree, each node
after the root depicts a feature extracted from the spam message that is shared by the subtrees beneath. Thus, each path in this tree shows sets of features that co-occur in messages,
with the property of non-increasing order of frequency of occurrences. The problem of FPTree is that it is based on frequency of features rather than creating pure clusters in terms of
homogeneity. The redundant features also are removed for specifying a campaign according
to the frequency property, while in our method redundant features are characterized based on
purity or homogeneity of campaigns. However, the greatest problem results from sensitivity
of FP-Tree to dynamic URL and text generation in layout detection. The reason is that the
layout is extracted line by line, which means two very similar emails with one line difference,
will be attributed to two different layout.
In summary, the previous works for clustering spam emails could be mainly divided in two
categories : The first group focus on pairwise comparison of each pair of emails, for example
URL comparison, and the second group consists of those in which a clustering algorithm is
used, for example O-means clustering. In general, the aforementioned previous works suffer
from one of the following problems : 1) They consider one or two features for grouping spam
messages, which decreases the accuracy, 2) The pairwise comparison is used, with quadratic
time complexity, 3) The number of clusters is required as a former knowledge, 4) The features
which create a pure cluster are not focused. In our proposed algorithm, we try to solve these
problems.
29
3.4
Categorical Clustering Tree (CCTree)
The general idea for construction comes from a supervised learning algorithm called Induction
Decision Tree (ID3) [109]. To create the CCTree, a set of objects is given in which each data
point is described in terms of a set of categorical attributes, e.g. the language of a message.
Each attribute represents the value of an important feature of data and is limited to assume a
set of discrete, mutually exclusive values, e.g. the Language as an attribute can take its values
or features as E nglish or F rench. Then, a tree is constructed in which the leaves of the tree
are the desired clusters, while other nodes contain non pure data needing an attribute-based
test to separate them. The separation is shown with a branch for each possible outcome of the
specific attribute values. Each branch or edge extracted from that parent node is labeled with
the selected value which directs data to the child node. The attribute for which the Shannon
entropy is maximum is selected to divide the data based on it. A purity function on a node,
based on Shannon entropy, is defined. Purity function represents how much the data belonging
to a node are homogeneous. A required threshold of node purity is specified. When a node
purity is equal or greater than this threshold, or the number of elements in a node is less than
a threshold, the node is labeled as a leaf or terminal node.
The precise process of CCTree construction can be formalized as follows :
— Input : Let D be a set of data points, containing N tuples on a set A of d attributes
and a set of stop conditions S.
Attributes An ordered set of d attributes A = {A1 , A2 , . . . , Ad } is given, where each
attribute is an ordered set of mutually exclusive values. Thus, the j’th attribute
could be written as Aj = {v1j , v2j , . . . , v(rj)j }, where rj is the number of features of
attribute Aj . For example Ai could be the Language of spam email, and the set of
possible values is {English, French, Spanish}.
Data Points A set D of N data points is given, where each data point is a vector
whose elements are the features of attributes, e.g. Di = (vi11 , vi22 , . . . , vidd ), where
vikk ∈ Ak is the ik ’th feature of the k’th attribute. For example we have : spam 1 =
(English, excel attachment, image based).
Stop Conditions A set of stop conditions S = ({µ, ε}) is given. µ is the “minimum
number of elements in a node”, i.e. when the number of elements in a node is less
than µ, then the node is not divided even if it is not pure enough. ε represents the
“minimum desired purity” for each cluster, i.e. when the purity of a node is better
or equal to ε, it will be considered as a leaf.
To calculate the node purity, a function based on Shannon entropy is defined as
follows :
Let Nkj i represents the number of elements having the k’th value of the j’th attribute in node i, and Ni be the number of elements in node i. Thus, considering
30
p(vkji ) =
Nkj i
Ni ,
the purity of node i, denoted by ρ(i), is defined as following :
ρ(i) = −
rj
d X
X
p(vkji )log(p(vkji ))
j=1 k=1
where d is the number of attributes, and rj is the number of features of j’th attribute.
— Output : A set of clusters which are the leaves of the categorical clustering tree.
S
red
blue
Sr
Sb
small
Sb.s
large
Sb.l
Figure 3.5 – A Small CCTree
We report in the following the process of creating the CCTree :
At the beginning all data points, as the set of N tuples, are assigned to the root of the tree.
Root is the first new node. The clustering process is applied iteratively for each new created
node. For each new node of the tree, the algorithm checks if the stop conditions are verified
and if the number of data points is less than a threshold M , or the purity, is less than or equal
to ε. In this case, the node is labeled as a leaf, otherwise the node should be split.
In order to find the best attribute to be used to divide the cluster, the Shannon entropy
based on the distribution of each attribute values is calculated. The attribute for which the
Shannon entropy is maximal is selected. The reason is that the attribute which has the most
equiprobable distribution of values, generates the highest amount of chaos (non homogeneity)
in a node. For each possible value of the selected attribute, a branch is extracted from the
node, with the label of that value, directing the data respecting that value to the corresponding
child node. Then the process is iterated until each node is either a parent node or is labeled
as a leaf. At the last step all final nodes or leaves of the tree are the set of desired clusters,
named {C1 , C2 , . . . , Ck }.
Figure 3.5 depicts an example of a small CCTree, whilst a formal description of algorithm is
given in Algorithm 1.
The source codes are provided in A.1.
31
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Algorithme 1 : Categorical Clustering Tree (CCTree) algorithm
Input : Input : Data points Dk , Attributes Al , Attribute Values Vm ,
node_purity_threshold, max_num_elem)
Output : Clusters Ck
Root node N0 takes all data points Dk
for each node Ni !=leaf node do
if node_purityi < node_purity_threshold||
num_elemi < max_num_elem then
Label Ni as leaf;
else
for each attribute Aj do
if Aj yields max Shannon entropy then
use Aj to divide data of Ni ;
generate new nodes Ni1 , . . . , Nit with t = size of V for attribute Aj ;
end
end
end
end
3.5
Time Complexity
The proposed structure-based methodology for clustering spam emails into campaigns respecting the aforementioned requirements of our problem, is linear in terms of complexity. This
property becomes more impressive when it is compared with the complexity of previous works
for grouping spam emails into campaigns, which are mostly based on pairwise comparison of
spam messages, suffering from quadratic time complexity, resulted from this kind of comparison.
Here, we briefly discuss the precise time complexity of the proposed methodology. Let us consider n as the number of the whole data set, ni the number of elements in node i, m the total
number of features, vl the number of features of attribute Al , r the number of attributes, and
vmax = argmax{vl } (l = 1, 2, . . . , d).
For constructing a CCTree, it is needed to create an ni ×m matrix based on the data belonging
to each non leaf node i, which takes O(m × ni ) time. For finding the appropriate attribute
for dividing data based on, constant time is required. To divide the ni points, based on the vl
features of selected attribute (Al ), O(ni × vl ) time is needed. This process is repeated in each
non leaf node. Thus, if K = m + 1 be the maximum number of non leaf nodes, which arises in
a complete tree, then the maximum time required for constructing a CCTree with n elements
equals to O(K × (n × m + n × vmax )).
Recalling that the number of features m and consequently K = m + 1 are constant number,
we conclude that the result is linear on the number of data points.
32
3.6
Conclusion
Spam emails impose a cost which is non negligible, damaging users and companies for several
millions of dollars each year. To fight spammers effectively, catch them or analyze their behavior, it is not sufficient to stop spam messages from being delivered to the final recipient.
Characterizing a spam campaign sent by a specific spammer, instead, is necessary to analyze
the spammer behavior. Such an analysis can be used to tailor a more specific prevention strategy which could be more effective in tackling the issue of spam emails. Considering a large set
of spam emails as a whole, makes the definition of spam campaigns an extremely challenging
task. Thus, we argue that a clustering algorithm is required to group this huge amount of
data, based on message similarities.
In this chapter we proposed a new categorical clustering algorithm, named CCTree, that we
argue to be useful in the problem of clustering spam emails. This algorithm, in fact, allows an
easy analysis of data based on an informative structure. The CCTree algorithm introduces an
easy-to-understand representation, where it is possible to infer at a first glance the criteria used
to group spam emails in clusters. This information can be used, for example, by officers to track
and persecute a specific subset of spam emails, which may be related to an important crime.
Here, we have mainly presented the theoretical results of our approach, the implementation
of the CCTree algorithm and its usage in clustering spam emails is presented in the following
chapter.
33
Chapitre 4
Effectiveness and Efficiency of CCTree
in Spam Campaign Detection
Spam emails yearly impose extremely heavy costs in terms of time, storage space and money
to both private users and companies. Finding and persecuting spammers and eventual spam
emails stakeholders should allow to directly tackle the root of the problem. To facilitate such
a difficult analysis, which should be performed on large amounts of unclassified raw emails, in
this chapter we propose a framework to fast and effectively divide large amount of spam emails
into homogeneous campaigns through structural similarity. The framework exploits a set of
21 features representative of the email structure and a novel categorical clustering algorithm
named Categorical Clustering Tree (CCTree). The methodology is evaluated and validated
through standard tests performed on three dataset accounting to more than 200k real recent
spam emails ([129]).
4.1
Introduction
Spam emails constitute a notorious and consistent problem still far from being solved. In the
last year, out of the daily 191.4 billions of emails sent worldwide in average, more than 70% are
spam emails [110]. Spam emails cause several problems, spanning from direct financial losses,
to misuses of Internet traffic, storage space and computational power [113]. Moreover, spam
emails are becoming a tool to perpetrate different cybercrimes, such as phishing , malware
distribution, or social engineering-based frauds.
Given the relevance of the problem, several approaches have already been proposed to tackle
the spam email issue. Currently, the most used approach for fighting spam emails consists
in identifying and blocking them on the recipient machine through filters, which generally
are based on machine learning techniques or content features, such as keywords, or non ascii
characters [30] [46] [123] [22]. Unfortunately, these countermeasures just slightly mitigate the
34
problem which still impose non negligible cost to users and companies [113].
To effectively fight the problem of spam emails, it is mandatory to find and persecute the
spammers, generally hiding behind complex networks of infected devices which send spam
emails against their user will, i.e. botnets. Thus, information useful in finding the spammer
should be inferred analyzing text, attachments and other elements of the emails, such as links.
Therefore, the early analysis of correlated spam emails is vital [44] [4]. However, such an
analysis, constitutes an extremely challenging task, due to the huge amount of spam emails,
which vastly increases hourly (8 billions per hour) [110] and for the high variance that related
emails may show, due to the use of obfuscation techniques [108]. To simplify this analysis,
huge amount of spam emails, generally collected through honey-pots, should be divided into
spam campaigns [132]. A spam campaign is the set of messages spread by a spammer with a
specific purpose [27], like advertising a product, spreading ideas, or for criminal intents.
In this chapter, we propose to use our algorithm presented in Chapter 3 on set of 21 attributes
to fast and effectively group large amount of spam emails by structural similarity. A set of 21
discriminative structural features are considered to obtain homogeneous email groups, which
identify different spam campaigns. Grouping spam emails on the base of their similarities is a
known approach. However, previous works mainly focus on the analysis of few specific parameters [4] [111] [132] [139], showing results whose accuracy is still somehow limited. The approach
is based on applying CCTree, a tree-like structure whose leaves represent the various spam
campaigns. The algorithm clusters (groups) emails through structural similarity, verifying at
each step the homogeneity of the obtained clusters and dividing the groups not enough homogeneous (pure) on the base of the attribute which yields the greatest variance (entropy). The
effectiveness of the proposed approach has been tested against 10k spam emails extracted from
a real recent dataset 1 , and compared with other well-known categorical clustering algorithm,
reporting the best results in terms of clustering quality (i.e. purity and accuracy) and time
performance.
The contribution of present chapter can be summarized as follows :
— We introduce a set of 21 categorical features representative of email structure, briefly discussing the discretization procedure for numerical features, which are used for
applying CCTree.
— The performance of CCTree has been thoroughly evaluated through internal evaluation,
to estimate the ability in obtaining homogeneous clusters and external evaluation, for
the ability to effectively classify similar elements (emails), when classes are known
beforehand. Internal and external evaluation have been performed respectively on a
dataset of 10k unclassified spam emails and 276 emails manually divided in classes.
— We propose and validate through analysis on 200k spam emails, a methodology to
choose the optimal CCTree configuration parameters based on detection of max curva1. http ://untroubled.org/spam
35
ture point (knee) on an homogeneity-number of clusters graph.
— We compare the proposed methodology with two general categorical clustering algorithms, and other methodologies specific for clustering spam emails.
The rest of this chapter is structured as follows. Section 4.2 describes the proposed framework,
detailing the extracted features and reporting implementation details. Section 4.3 reports the
experiments to evaluate the ability of CCTree in clustering spam emails, comparing the results
with the ones of two well known categorical clustering algorithms. Also the methodology to
set the CCTree parameters is reported and validated. Section 4.4 discuss limitations and
advantages of the proposed approach reporting result comparison with some related work.
Other related work on clustering spam emails is presented in Section 4.5. Finally Section 4.6
briefly concludes proposing future directions.
4.2
Framework
The presented framework acts in two steps. At first raw emails are analyzed by a parser to
extract vectors of structural features. Afterward the collected vectors (elements) are clustered through the introduced CCTree algorithm. This section reports details on the proposed
framework for analysis and clustering spam emails and extracted features.
4.2.1
Feature Extraction and Definition
To describe spam emails, we have selected a set of 21 categorical attributes, which are representatives of the structural properties of emails. The reason is that the general appearance of
messages belonging to the same spam campaign mainly remain unchanged, although spammers
usually insert random text or links [27]. The selected attributes extends the set of structural
features proposed in [99] to label emails as spam or ham.
The attributes and a brief description are presented in Table 4.1.
Since the clustering algorithm is categorical, all selected features are categorical as well. It is
worth noting that some features are meant to represent numerical values, e.g. AttachmentSize,
instead that categorical ones. However, it is always possible to turn these features from numerical into categorical, defining intervals and assigning a feature value to each interval defined
in such a way. We chose these intervals on the base of the ChiMerge discretization method
[85], which returns outstanding results for discretization in decision tree-like problems [56].
The detail of discretization results are provided in Tables A.1, A.2, A.3, A.4,A.5, A.6, A.7,
A.8, A.9, A.10, A.11, A.12, A.13, A.14, A.15, A.16, A.17, A.18.
Features of particular interest are the ones that report the amount of pictures in the email
(ImagesNumber), or the presence of HTML tags (IsHTML), or again, the amount of links
(NumberOfLinks). Through these features, in fact, it is possible to determine if the email
36
Attribute
RecipientNumber
NumberOfLinks
NumberOfIPBasedLinks
NumberOfMismatchingLinks
NumberOfDomainsInLinks
AvgDotsPerLink
NumberOfLinksWithAt
NumberOfLinksWithHex
SubjectLanguage
NumberOfNonAsciiLinks
IsHtml
EmailSize
Language
AttachmentNumber
AttachmentSize
AttachmentType
WordsInSubject
CharsInSubject
ReOrFwdInSubject
NonAsciiCharsInSubject
ImagesNumber
Description
Number of recipients addresses.
Total links in email text.
Links shown as an IP address.
Links with a text different from the real link.
Number of domains in links.
Average number of dots in link in text.
Number of links containing “@”.
Number of links containing hex chars.
Language of the subject.
Number of links with non-ASCII chars.
True if the mail contains html tags.
The email size, including attachments.
Email language.
Number of attachments.
Total size of email attachments.
File type of the biggest attachment.
Number of words in subject.
Number of chars in subject.
True if subject contains “Re” or “Fwd”.
Number of non ASCII chars in subject.
Number of images in the email text.
Table 4.1 – Features extracted from each email.
is raw text, contains several images, or is presented in the form of a web page, which mostly
remain unchanged when a spammer designs a spam campaign to be sent in burst.
4.2.2
Implementation Details
On the implementation side, an email parser has been developed in Java to automatically
analyze raw mails text and extract the features in form of vectors. The software exploits the
JSoup [69] for HTML parsing, and of the LID 2 Python tools for language recognition. The
LID software exploits the technique of n-grams to recognize the language of a text. For each
language that LID has to recognize, a database of words must be provided to the software,
in order to extract n-grams. The language on which LID has been trained are the following :
English, Italian, French, German, Spanish, Portuguese, Chinese, Japanese, Persian, Arabic,
Croatian. We have implemented the CCTree algorithm using the MATLAB 3 software, which
takes as input the matrix of emails features extracted by the parser.
It is worth noting that the complete framework, i.e. feature extraction and clustering module,
are totally portable on different operation system. In fact, both the feature extraction module
2. http ://www.cavar.me/damir/LID/
3. http://mathworks.com
37
and the clustering module (i.e. MATLAB) are Java-based and executable on the vast majority
of general purpose operative system (Java, UNIX, iOS, etc.). Also the Python module for
language analysis it is portable. Moreover, LID has been made as a disposable component,
i.e. if the Python interpreter is missing, the analysis is not stopped. For the emails where the
language is not inferable, the UKNOWN_LANGUAGE value for the attribute is used instead.
4.3
Evaluation and Results
This section reports on the experimental results to evaluate the quality of the CCTree algorithm on the problem of clustering spam emails. A first set of experiments has been performed
on a dataset of 10k recent spam emails (February 2015), to estimate the capability of the
CCTree algorithm in obtaining homogeneous clusters. This evaluation is known as Internal
Evaluation and estimates the quality of the clustering algorithm, measuring how much each
element of the resulting cluster is similar to the elements of the same cluster and dissimilar
from the elements of other clusters. A second set of experiments aims at assessing the capability of CCTree to correctly classify data using a small dataset with benchmark classes
known beforehand. This evaluation is named External Evaluation and measures the similarity
between the resulting clusters of a specific algorithm and the desired clusters (classes) of the
pre-classified dataset. For external evaluation, CCTree has been tested against a dataset of
276 emails, manually labeled in 29 classes 4 . The emails have been manually divided, looking
both at the structure and the semantic of the message. Thus, emails belonging to one class
can be considered as part of a single spam campaign.
The results of CCTree are compared with those of two categorical clustering algorithms, namely COBWEB and CLOPE, well-known of being accurate and fast clustering algorithms,
respectively. The comparison has been done both for internal and external evaluation on the
same aforementioned datasets. A time performance analysis is also reported. It is worth noting
that the three algorithms are all implemented on Java-based tools, hence the validity of time
comparison.
In what follows, we briefly introduce these two algorithms :
COBWEB COBWEB proposed by [51], is a categorical clustering algorithm, which builds a
dendrogram where each node is associated with a conditional probability which summarizes the attribute-value distributions of objects belonging to a specific node. Differently
from the CCTree algorithm, also includes a merging operation to join two separate nodes
in a single one. COBWEB is computationally demanding and time consuming, since it
re-analyzes at each step every single data point. Actually, COBWEB employs four operations as following :
• merging two nodes : in merging of two nodes, the two nodes are replaced by a node
4. Available at : http ://security.iit.cnr.it/images/Mails/cctreesamples.zip
38
whose children is the original nodes of children and it summarizes the attribute-value
distribution of the elements classified under them.
• Split a node : a node is split by replacing it with its own children
• inserting a new node : a new node is created for a new data inserting to the tree
• passing a datum in the tree : the datum is located in the node it respects.
However, the COBWEB algorithm is used in several fields for its good accuracy, in a
way that its similarity distance measure, named Category Utility, is used to evaluate
categorical clustering accuracy [7], and is formally defined as what follows.
Definition 4.1. Category Utility (CU) : The Category utility [60] is defined as the
difference between the expected number of attribute values which can be guessed correctly
with the given clustering algorithm, and the expected number of correct guess when we
do not have this knowledge. Let {C1 , . . . , Ck } are the set of clusters, and vij ’s (for all
possible j) are the values of attribute Ai , then CU is defined as following :
X |Ci | X X
CU =
[P (Ai = vij |Ci )2 − P (Ai = vij )2 ]
k
i
j
Ci
The WEKA [65] implementation of COBWEB has been used for our experiments.
CLOPE : CLOPE [148] is a fast categorical clustering algorithm which maximizes the number of elements with the same value for a subset of attributes, attempting to increase
the homogeneity of each obtained cluster. In this algorithm, a global criterion function
is proposed to increase the intra-cluster overlapping by increasing the height-to-weight
ratio of the cluster histogram. The clustering with maximum amount of height-to-width
ratio on all cluster histograms is the optimum result. Formally, the CLOPE clustering
is defined as what follows :
Let X = {x1 , x2 , . . . xn } be the set of n tuples, while all the features of data point
xi
1 ≤ i ≤ n are categorical. Suppose C = {C1 , C2 , . . . Ck } represents the devision of
X to k clusters, and D(Ci ) shows the statistic histogram of Ci respect to the categorical
attributes. Two measure functions are introduced in this method as follows :
X
|xj |
S(Ci ) =
xj ∈Ci
where |xj | is the dimensionality of xj .
W (Ci ) = |Hi |
where |Hi | is the number of bins in histogram Hi . Then, the criterion function of CLOPE
is defined as :
k
1 X S(Ci )
max{P rof it(C) =
|Ci |}
n
W (Ci )2
i=1
where |Ci | is the number of elements in cluster Ci .
Also for CLOPE we have used the WEKA [65] implementation for the performed experiments.
39
4.3.1
Internal Evaluation
When the result of clustering algorithm is evaluated based on the data that was clustered itself,
it is called internal evaluation. Internal evaluation measures the ability of a clustering algorithm
in obtaining homogeneous clusters. A high score on internal evaluation is given to clustering
algorithms which maximize the intra-cluster similarity, i.e. elements within the same cluster
are similar, and minimize the inter-cluster similarity, i.e. elements from different clusters are
dissimilar. The cluster dissimilarity is measured by computing the distances between elements
(data points) in various clusters. The used distance function changes for the specific problem. In
particular, for elements described by categorical attributes, the common geometric distances,
e.g. Euclidean distance, cannot be used. Hence, in this work the Hamming and Jaccard distance
measures [66] are applied. The Hamming distance considers two elements closer when they have
the same value for a higher number of attributes. On the other hand, the Jaccard distance is
defined as the size of intersection of attributes of two elements divided by their union. Internal
evaluation can be performed directly on the dataset on which the clustering algorithm operates,
i.e. the knowledge of the classes (desired clusters) is not a prerequisite. The indexes used for
internal evaluation are the Dunn Index [19] and the Silhouette [118], which are defined as
follows :
Dunn index : Let ∆i be the diameter of cluster Ci , that can be defined as the maximum
distance of elements of Ci :
∆i =
max
x,y∈Ci , x6=y
{d(x, y)}
where d(x, y) measures the distance of pair x and y, which can be considered any distance
as specified by user, e.g. Hamming distance, and |C| shows the number of elements
belonging to cluster C. Also, let δ(Ci , Cj ) be the inter-cluster distance between clusters
Ci and Cj , which is calculated as the pairwise distance between elements of two clusters.
Then, on a set of k clusters, the Dunn index [64], is defined as :
DIk = min { min {
1≤i≤k 1≤j≤k
δ(Ci , Cj )
}}
max1≤t≤k ∆t
A higher Dunn index value means a better cluster quality. It is worth noting that the
value of Dunn index is negatively affected by the greatest diameter between the elements
of all generated clusters (max1≤t≤k ∆t ). Hence, even a single resulting cluster with poor
quality (non homogeneous), will cause a low value of the Dunn index. On the other hand,
higher values of this index means that the overall homogeneity of resulting clusters is
noticeable.
Silhouette : The dissimilarity of point xi from a cluster C is the average distance from xi
to points of C. Mostly, dissimilarity refers to distance measure, where for categorical
attributes, distance measure can be considered as hamming distance. Let d(xi ) be the
40
average dissimilarity of data point xi with other data points within the same cluster.
Also, let d0 (xi ) be the lowest average dissimilarity of xi to any other cluster, except the
cluster that xi belongs to. Then, the silhouette [118], s(i) for xi is defined as :



 1−
d(i)
d0 (i)
d0 (i) − d(i)
s(i) =
=
0
max{d(i), d0 (i)} 

 d0 (i) − 1
d(i)
d(i) < d0 (i)
d(i) = d0 (i)
d(i) > d0 (i)
where the definition result in s(i) ∈ [−1, 1]. As much as s(i) is closer to 1, the more
the data point xi is appropriately clustered. The average value of s(i) over all data of a
cluster, shows how tightly related are data within a cluster. Hence, the more the average
value of s(i) is close to 1, the better is the clustering result. For easy interpretation, the
silhouette of all clustered points is also represented through a silhouette plot.
Performance Comparison
As discussed in Chapter 3, CCTree algorithm requires two stop conditions as input, i.e. the
minimum number of elements in a node to be split (µ), and minimum purity in a cluster
(). Henceforth, the notation CCTree(, µ) will be used to refer the specific implementation
of the CCTree algorithm. To choose the stop conditions, we first fix the minimum number
of elements µ = 1, and then we change node purity to see how internal indexes are affected.
Worth noticing that when µ is fixed to 1, the only stop conditions affecting the result is .
Table 4.2 shows the result of internal evaluations when µ is fixed to 1 and gets five different
values as 0.0001, 0.001, 0.01, 0.1 and 0.5.
Table 4.2 – CCTree Internal evaluation with fixed number of elements.
Algorithm
Silhouette(Hamming)
Silhouette(Jaccard)
Dunn(Hamming)
Dunn(Jaccard)
= 0.0001
0.9772
0.9777
0.5
0.25
CCTree - µ = 1
= 0.001 = 0.01
0.9772
0.9642
0.9777
0.9650
0.5
0.2
0.25
0.1571
= 0.1
0.7124
0.7110
0.1111
0.1032
= 0.5
0.1040
0.0820
0.0714
0.0704
For a further insight, we report in Figures 4.1, 4.2, 4.3, 4.4 the Hamming distance silhouette
plots for the CCTrees with the same parameters of Table 4.2. The graphs are horizontal
histograms in which every bar represents the silhouette result, s(i) ∈ [−1, 1], for each data
point xi , as from the aforementioned definition. It can be seen that both CCTree(0.001,1)
(Fig. 4.1) and CCTree(0.01,1) (Fig. 4.2) do not show negative values for any data point,
hence, the high value of silhouette close to 1. Actually, the first row of Table 4.2 shows the
average of silhouette result for all points CCTree is constructed on, with identified input
41
Figure 4.1 – CCTree(0.001,1)
Figure 4.2 – CCTree (0.01,1)
42
Figure 4.3 – CCTree(0.1,1)
Figure 4.4 – CCTree(0.5,1)
43
parameters. The white spaces in plots show the points for which the silhouette result equals
to one.
Figure 4.5 graphs the internal evaluation measurements of CCTree with five different values
of , when the minimum number of elements µ has been set to 1. It is worth noting that if
µ = 1, the only stop condition affecting the result is the node purity. This is the reason that
we first fix µ = 1 to find the best amount of required node purity for our dataset.
Figure 4.5 – Internal evaluation at the variation of the parameter.
As shown in Figure 4.5, the purity value reach the maximum and stabilize when = 0.001.
More strict purity requirements (i.e., < 0.001) do not further increase the precision. This
value of will be fixed for the following evaluations. More precisely, we first fix one of the
input parameters in a way that it does not affect the result, i.e. µ = 1, and attribute different
values to other parameter. We see that in one point, more strict parameter is not affecting the
general homogeneity result.
Fixing the node purity = 0.001, we then look for the better value for the µ parameter to
be able to compare CCTree performance with accurate COBWEB and fast CLOPE. To this
end, we provide four different values of minimum number of elements in a cluster. Table 4.3
presents the Silhouette and Dunn index results for proposed values of µ, namely 1, 10, 100,
and 1000. In addition, the last two rows of Table 4.3 reports the resulting number of clusters
and the time required to generate the clusters.
Table 4.3 also reports the comparison with the two categorical clustering algorithms COBWEB
44
Table 4.3 – Internal evaluation results of CCTree, COBWEB and CLOPE.
Algorithm
COBWEB
Silhouette(Hamming)
Silhouette(Jaccard)
Dunn(Hamming)
Dunn(Jaccard)
Clusters
Time (s)
0.9922
0.9922
0.1429
0.1327
1118
17.81
µ=1
0.9772
0.9777
0.5
0.25
619
0.6027
CCTree - = 0.001
µ = 10 µ = 100 µ = 1000
0.9264
0.7934
0.5931
0.9290
0.8021
0.6074
0.1
0.0769
0.0769
0.1
0.0879
0.0857
392
154
59
0.3887
0.1760
0.08610
CLOPE
0.2801
0.2791
0
0
55
3.02
and CLOPE, previously described. The first two columns from left, show comparable results
in term of clustering precision for the silhouette index. In fact, COBWEB and CCTree have
both a good precision, when the CCTree purity is set to = 0.001 and the minimum number
of elements is set to µ = 1 (CCTree(0.001,1)). COBWEB performs slightly better on the
silhouette index, for both distance measures. However, the difference (less than 2 percent) is
negligible by considering that COBWEB creates almost twice more the number of clusters
rather than CCTree(0.001,1).
It can be inferred that a higher number of small clusters improves the internal homogeneity
(e.g., a cluster with one element is totally homogeneous). However, as it will be detailed in
the following, a number of clusters, strongly greater than the expected number of groups,
is not desirable. In fact, it can be inferred from the Silhouette definition that, in case every
element xi is unique, the maximum theoretical value is achieved if each cluster contains only
one element.
Moreover, CCTree(0.001,1) returns better result for the Dunn index, with respect to COBWEB. We recall that the value of Dunn index is strongly affected by the cluster homogeneity
of the worst resulting cluster. The value returned for CCTree(0.001,1) shows that all the
returned clusters globally have a good homogeneity, compared to COBWEB, i.e. the worst
cluster for CCTree(0.001,1) is much more homogeneous than the worst cluster for COBWEB.
The rightest column of Table 4.3 reports the results for the CLOPE clustering algorithm.
CLOPE is a categorical clustering algorithm, known to be fast in creating as much as possible
pure clusters. The accuracy of CLOPE is quite limited for Silhouette, and zero for the Dunn
index, whilst CCTree(0.001, 1000) with almost the same number of clusters is 35 times faster
than CLOPE.
A graphical description of the accuracy difference between the clustering of Table 4.3 can
be inferred from the Hamming Silhouette plots of Figures 4.6, 4.7, 4.8 , 4.9, 4.10, and 4.11.
The plots are horizontal histograms in which every bar represents the silhouette result, s(i) ∈
[−1, 1], for each data point xi , as from the aforementioned definition.
45
Figure 4.6 – COBWEB
Both COBWEB and CCTree(0.001,1) show no negative values, with the majority of data
points scoring s(i) = 1. In fact, for CCTree(0.001, 1000) the worst data points do not score
less than −0.5, whilst for CLOPE some data points have a silhouette of −0.8, which will
cause a strong non-homogeneity in their clusters. Also, the number of data point with positive
values are much more for CCTree(0.001,1000), than for CLOPE. This also justifies the better
value of Dunn index for the CCTree(0.001,1000), which we recall to be affected by the nonhomogeneity of the worst cluster. Also, the number of data point with positive values are
much more for CCTree(0.001,1000), than for CLOPE, even if CLOPE returns some points
whose Silhouette value is 1. However, CCTree(0.001,1000) returns a better Silhouette than
CLOPE.
The outstanding point is that the runtime of CCTree(1000) is almost 30 times less than
CLOPE, applicable as a fast categorical clustering algorithm.
Finally, Table 4.3 also reports the time elapsed for the clustering performed by the algorithms.
It can be observed that COBWEB pay its accuracy with an elapsed time of 17 seconds on
the dataset of 10k emails, against the 3 seconds of the much more inaccurate CLOPE. The
CCTree algorithm outperforms both COBWEB and CLOPE, requiring only 0.6 seconds in
the most accurate configuration (CCTree(0.001,1)).
From internal evaluation we can thus conclude that the CCTree algorithm obtains clusters
whose quality is comparable with the ones of COBWEB, requiring even less computational
time than the fast but inaccurate algorithm CLOPE.
46
Figure 4.7 – CCTree(0.001,1)
Figure 4.8 – CCTree(0.001,10)
47
Figure 4.9 – CCTree(0.001,100)
Figure 4.10 – CCTree(0.001,1000)
48
Figure 4.11 – CLOPE
4.3.2
CCTree Parameters Selection
Through internal evaluation and the results reported in Table 4.3 and Figures 4.6, 4.7, 4.8 ,
4.9, 4.10 , 4.11, we showed the dependence of the internal evaluation indexes and number of
clusters to the values of µ and parameters. We will briefly discuss here some guidelines to
correctly choose the CCTree parameters to maximize the clustering effectiveness.
Concerning the parameter, we showed in Section 4.3 that it is possible to find the optimal
value of by setting µ = 1 and varying the to find the fixed point in terms of accuracy, i.e.
the optimal was considered the one for which the lesser amount of is not improving the
accuracy.
Fixed the parameter , the parameter µ must be selected to balance the accuracy with the
number of generated clusters. As the number of cluster is affected by the µ parameter, it is
possible to choose the optimal value of µ knowing the optimal number of clusters. The problem
of estimating the optimal number of clusters for hierarchical clustering algorithms, has been
solved in [120], by determining the point of maximum curvature (knee) on a graph showing
the inter-cluster distance in function of the number of clusters.
Recalling that silhouette index is inversely related to inter-cluster distance, it is sound to find
the knee on the graph of Figure 4.12 computed with the silhouette (Hamming) on the dataset
used for internal evaluation, with seven different values of µ. The graph reports the values
computed on the same dataset used for internal evaluation. For the sake of representation, we
do not show in this graph plots for µ greater than 100.
49
Figure 4.12 – Silhouette in function of the number of clusters for different values of µ.
Figure 4.13 – Sihouette (Hamming).
Applying the L-method described in [120], it is possible to find that the knee is located at
µ = 10. It is worth recalling from Table 4.3 that the knee value for µ gives a silhouette value
higher than 90%, keeping the number of generated clusters much lower than the ones obtained
when µ = 1.
Table 4.4 – Silhouette values and number of clusters in function of µ for four email datasets.
Data
February
March I
March II
March III
µ=1
Silhouette Clusters
0.9772
610
0.9629
635
0.9385
504
0.9397
493
µ = 10
Silhouette Clusters
0.9264
385
0.9095
389
0.8941
306
0.8926
296
µ = 100
Silhouette Clusters
0.7934
154
0.7752
149
0.8127
145
0.8102
131
µ = 1000
Silhouette Clusters
0.5931
59
0.6454
73
0.6608
74
0.6156
44
A further insight can be taken from the results of Table 4.4 and Figures 4.13, 4.14, reporting
50
Figure 4.14 – Generated Clusters.
the analysis on three additional datasets of spam emails coming from three different weeks
of March 2015 from the spam honeypot 5 . The sets have comparable size with the one of the
dataset used for internal evaluation (first week of February 2015), with respectively 10k, 10k
and 9k spam emails.
From both the table and the graph it is possible to infer how the same trend for both silhouette
value and number of clusters holds for all the tested datasets. Hence, we verify (i) the validity of
the the knee methodology, (ii) the possibility of using the same CCTree parameters for datasets
with the same data type and comparable size. Finally, for a further insight we graphically
report the results of Table 4.4 in Figures 4.13, 4.14.
From both figures it is visible how the four datasets follow the same trends for internal evaluation indexes and number of clusters with the same values for the µ parameter.
To give statistical validity to the performed analysis on parameter determinacy, we have analyzed a dataset of more than 200k emails collected from October 2014 to May 2015 from the
honey pot 6 . The emails have been divided in 21 datasets, containing 10k spam emails each.
Each set represents one week of spam emails.
Tables 4.5 and 4.6 report the result of silhouette index and number of clusters for 6 months,
from October 2014 to May 2015, except February and March which were already reported in
Tables 4.4 and 4.3.
To show that silhouette value and number of clusters of spam campaigns of Tables 4.5, 4.6,
4.4 and 4.3 keep the trends of data sets with comparable size, we first in what follows briefly
5. http ://untroubled.org/spam
6. http ://untroubled.org/spam
51
Table 4.5 – Silhouette result, hamming distance, = 0.001, and µ changes
Month
Oct1
Oct2
Oct3
Nov1
Nov2
Nov3
Dec1
Dec2
Dec3
Jan1
Jan2
Jan3
Apr1
Apr2
Apr3
May1
May2
May3
µ=1
0,9264
0,9223
0,9071
0,9228
0,9655
0,9702
0,9566
0,9626
0,9787
0,9697
0,9683
0,9739
0,9706
0,9694
0,9738
0,9675
0,964
0,9668
µ = 10
0,88
0,8625
0,8555
0,8706
0,9083
0,9064
0,9012
0,9108
0,942
0,9387
0,9369
0,9441
0,9161
0,9174
0,9361
0,9184
0,9128
0,9299
µ = 100
0,7936
0,7299
0,7474
0,7616
0,7873
0,7951
0,7736
0,7784
0,8451
0,8876
0,8776
0,8923
0,7894
0,8234
0,8344
0,7679
0,7712
0,819
µ = 1000
0,5738
0,5557
0,6623
0,5593
0,5054
0,5078
0,6264
0,651
0,6739
0,6675
0,7407
0,662
0,6894
0,6378
0,6866
0,5553
0,4434
0,5068
Table 4.6 – Number of Clusters , = 0.001, and µ changes
M onth
Oct1
Oct2
Oct3
Nov1
Nov2
Nov3
Dec 1
Dec2
Dec3
Jan1
Jan2
Jan3
Apr1
Apr2
Apr3
May1
May2
May3
µ=1
507
652
562
564
689
685
586
583
437
366
366
344
574
528
408
622
578
474
µ = 10
310
376
333
312
399
391
359
343
273
237
216
208
341
304
242
419
372
313
52
µ = 100
141
143
124
128
161
172
135
127
118
132
110
118
127
131
101
159
133
131
µ = 1000
31
61
64
50
56
61
66
64
50
44
43
37
54
53
47
73
60
48
present what standard deviation means.
Standard Deviation : In statistics, the standard deviation (mostly shown as σ) [11] is a
measure for quantifying the amount of variation of a set of values. A standard deviation
which is close to 0 shows that the data points tend to be very close to the mean, whilst
the high amount of standard deviation indicates that data points are spread out over a
wide range of values. Formally, let X be a random variable with mean value E(X), the
standard deviation of X equals to :
σ=
p
E(X 2 ) − (E(X))2
It means that the standard deviation σ is the square root of the variance of X.
Figures 4.15 and 4.16 show the average values for number of clusters and silhouette computed on the 21 dataset varying the value of the µ parameter with the values of the former
experiments (i.e. 1, 10, 100, 1k). The standard deviation (defined above) is also reported as
error bars. It is worth noting that, the standard deviation for the values of µ = {1, 10} on 10
datasets is slightly higher than 2%, while it reaches 4% for µ = 100 and 8% for µ = 1000,
which is in line with the results of Table 4.4. Comparable results are also obtained for the
number of clusters where the highest value for standard deviation is, as expected, for µ = 1,
amounting to 108, which again is in line with the results of Table 4.4. Thus, for all the 21
analyzed datasets, spanning eight months of spam emails, we can always locate the knee for
silhouette and number of clusters when µ = 10.
Figure 4.15 – Sihouette (Hamming).
4.3.3
External Evaluation
The external evaluation is a standard technique to measure the capability of a clustering
algorithm to correctly classify data. To this end, external evaluation is performed on a small
dataset, whose classes, i.e the desired clusters, are known beforehand. This small dataset must
53
Figure 4.16 – Generated Clusters.
be representative of the operative reality, and it is generally separated from the dataset used
for clustering.
The external evaluation is a standard technique to measure the capability of a clustering
algorithm to correctly classify data. To this end, external evaluation is performed on a small
dataset, whose classes, i.e the desired clusters, are known beforehand. This small dataset must
be representative of the operative reality, and it is generally separated from the dataset used
for clustering. A common index used for external evaluation is the F-measure [98], which
coalesce in a single index the performance aboutfor cectly classified elements and misclassified
oneelement external evaluation, the result of clustering algorithm is evaluated based on the
data which are not used for clustering. The class of of these data are prior known. This set
of pre-classified items are often labeled by human experts. External measures for clustering
evaluate how close the result of clustering algorithm is to predetemined labeled data.
Formally, let the sets {C1 , C2 , . . . , Ck } be the desired clusters (classes) for the dataset D, and
let {C10 , C20 , . . . , Cl0 } be the set of clusters returned by applying a clustering algorithm on D.
Then, the F-measure F (i) for each cluster Ci is defined as follows :
2|Ci ∩ Cj0 |
F (i) = max
i,j |Ci | + |Cj0 |
To compute the overall F-Measure on all desired clusters, as an aggregated index, the weighted
average F-Measures of all predetermined clusters is computed as :
54
Fc =
k
X
F (i)
i=1
|
|Ci |
k
∪j=1 Cj |
The F-measure result is returned in the range [0,1], where 1 represents the ideal situation
in which the cluster Ci is exactly equal to one of the resulted clusters. More precisely, in FMeasure the elements of each predetermined class is compared with the elements of all resulted
clusters, and the maximum similarity, equal to 1, is returned when in resulted clusters there
is one identical to predetermined class.
Experimental Results
For the sake of external evaluation, 276 spam emails collected from different spam folders of
different mailboxes has been manually analyzed and classified. Emails have been divided in
29 groups (classes) according to the structural similarity of raw email message. The external
evaluation set has no intersection with the one used for internal evaluation.
Table 4.7 – External evaluation results of CCTree, COBWEB and CLOPE.
Algorithm
COBWEB
F-Measure
Clusters
0.3582
194
µ=1
0.62
102
CCTree - = 0.001
µ = 5 µ = 10 µ = 50
0.6331 0.6330
0.6
73
50
26
CLOPE
0.0076
15
The results of external evaluation are reported in Table 4.7. Building on the results of internal
evaluation, the value of node purity has been set to = 0.001 to obtain homogeneous clusters.
The values of µ have been chosen according to the following rationale. µ = 1 represents a
CCTree instantiation in which the µ parameter does not affect the result. On the other hand
µ = 50 returns a number of clusters comparable with the 29 clusters manually collected.
Higher values of µ do not modify the result for a dataset of this size.
The best results are returned for µ = {5, 10}. The F-measure for these two values is higher
than 0.63, with a negligible difference, even if the number of generated clusters is higher than
the expected one. The F-measure, in fact is considers as correctly classified, elements from a
same original cluster C even if divided in more clusters C10 , . . . , Cn0 which do not contain data
from other clusters different from C.
Table 4.7 also reports the comparison with the COBWEB and CLOPE algorithms. As showns
CCTree algorithm outperforms COBWEB and CLOPE for the F-measure index, showing thus
a higher capability in correctly classifying spam emails. We recall that for internal evaluation,
55
COBWEB returned slightly better results than CCTree. The reason of this difference is resulted from the number of resulting clusters. COBWEB, in fact, always returns a high number
of clusters (Table 4.7). This generally yields a high cluster homogeneity (several small and
homogeneous clusters). However, it not necessarily implies a good classification capability. In
fact, as shown in Table 4.7, COBWEB returns almost 200 clusters on a dataset of 276 emails,
which is six times the expected number of clusters (i.e., 29 clusters). This motivates the lower
F-measure score for the COBWEB algorithm. It is worth noting that the CCTree outperform
COBWEB even if the minimum number of elements per node is not considered (i.e., µ = 1).
On the other hand, CLOPE also performs poorly on F-measure for the 276 emails dataset.
The CLOPE algorithm, in fact, only produced 15 clusters, i.e. less than half of the expected
ones, with an F-measure score of 0.0076.
4.4
Discussion and Comparisons
[86] introduced three properties related to clustering algorithms, named scale invariance (requiring that the output of a clustering be invariant to uniform scaling of output), consistency
(requiring that if within-cluster distances are decreased, and between-cluster distances are
increased, then the output of a clustering function does not change) and richness (requiring
that by modifying the distance function, any partition of the underlying data set can be obtained). In his famous theorem, Kleinberg proved that “independent of any particular algorithm,
objective function, or generative data model”, there is no clustering function simultaneously
satisfies the proposed three properties. The Kleinberg theorem is referred in the literature
[147], [102], [137], [124] to justify that no clustering algorithm stand as a perfect function for
a specific problem, instead it is required to respect, as much as possible, the specified and
desired properties of the associated problem.
The presented methodology based on a set of 21 categorical features and a novel categorical
clustering algorithm, named CCTree, shows the capability of dividing spam emails in very
homogeneous cluster, with a good accuracy in discerning different campaigns. The comparison
with other categorical clustering algorithms showed the effectiveness and efficiency of CCTree
when applied to the same set of features. We recall that CCTree is an unsupervised machine
learning technique. Unsupervised learning algorithms do not provide results with the same
accuracy of supervised learning techniques (i.e. trained classifiers). However, they have the
advantage of not requiring any training procedure, thus they can be applied also on datasets
for which no previous knowledge is available. This often represent the reality in the analysis of
the spam emails, where it is necessary to cope with the large amount of emails daily produced
and collected by honeypots.
Combining the analysis of 21 features, the proposed methodology, becomes suitable to analyze
almost any kind of spam emails. This is one of the main advantages with respect to other
56
approaches, which mainly exploit one or two features to cluster spam emails into campaigns.
These features are links [4], [92], keywords [22], [30], [46], [123], or images [150] alternatively.
The analysis of these methodologies remains limited to those spam emails that effectively
contain the attributed features. However, emails without links and/or images are a consistent
percentage of spam emails.
In fact, from the analysis of the dataset used for internal evaluation, 4561 emails out of 10165
do not contain any link. Furthermore, only 810 emails contain images. To verify the clustering
capability of these methodologies we have implemented three programs to cluster the emails
of the internal evaluation dataset on the base of the contained URLs, reported domains and
links of remote images. The emails without links and pictures have not been considered.
Table 4.8 – Campaigns on the February 2015 dataset from five clustering methodologies.
Cluster Methodology
Link Based Clustering
Domain Based Clustering
Image Based Clustering
COBWEB (21 Features)
CCTree(0.001,10)
Analyzed emails
4561
4561
810
10165
10165
Generated Campaigns
4449
4547
807
1118
392
Table 4.8 reports the generated number of clusters for each methodology. It is worth noting
that on large dataset these cluster methodologies are highly inaccurate, generating a number
of campaigns close to the number of analyzed elements, hence, almost every cluster has a
single element.
For comparison purpose we reported the results of the most accurate implementation of CCTree and of COBWEB, which we recall being able to produce extremely homogeneous cluster,
reporting a Silhouette value of 99%. We point out that, comparing Silhouette is meaningless,
due to the different sets of used features. Comparisons with other methodologies such as the
FPTree-based approaches [27] [44], which require the extraction and analysis of a different set
of features, are left as a future work.
4.5
Related Work
As discussed in Section 4.4, there are several works in literature which cluster spam emails
through pairwise comparisons of URLs, IP address resolved from URLs, domains and image
links. The main weaknesses of these approaches is that they cannot be applied on emails not
presenting the required features. Also the pairwise comparison impose a quadratic complexity,
against the linear one in CCTree. Another clustering approach exploiting a pairwise comparisons of email subjects is presented in Wei et al. [144]. The proposed methodology introduces
a set of eleven features to cluster spam emails through two clustering algorithms. At first an
57
agglomerative hierarchical algorithm is used to cluster the whole data set based on subject
pairwise comparison. Afterward, the connected component graph algorithm is used to improve
the performance.
The authors of [132] applied a methodology based on k-means algorithm, named O-means
clustering, which exploits twelve features extracted from each email. The O-mean algorithm
works on the hypothesis that the number of clusters is known beforehand, which is not always
a working hypothesis. Furthermore, the authors use Euclidean distance, which for several features that they apply, it does not bring meaningful information. Differently from this approach
the CCTree exploits the more general distance measure, i.e. Shannon entropy. Moreover the
CCTree does not require the desired number of clusters as input parameter.
Frequent Pattern Tree (FP-Tree), is another technique applied to detect spam campaigns in
large datasets. The authors of [27],[44] extract set of features from each spam message. The FPTree is built based on the frequency of features. The sensitive representation of both message
layout and URL features, causes that two spam emails with small difference be assigned to
different campaigns. For this reason, FP-Tree approach becomes prone to text obfuscation
techniques [108], used to deceive anti-spam filters and to emails with dynamically generated
links. Our methodology, based on categorical features which do not consider text and link
semantics is more robust against these deceptions.
4.6
Conclusion
In this chapter, we proposed a methodology, based on a categorical clustering algorithm, named
CCTree, to cluster large amount of spam emails into campaigns, grouping them by structural
similarity. To this end, the set of features representing message structure, have been precisely
chosen and the intervals for each feature has been found through discretization approach.
The CCTree algorithm has been extensively tested on two dataset of spam emails, to measure
both the capability in generating homogeneous clusters and the specificity in recognizing
predefined groups of spam emails. The guideline for selecting CCTree parameters is provided,
whilst the determinacy of selected parameter for the similar data set with the same size has
been proven statistically. To this end, several datasets of spam emails, each containing large
amount of spam messages (each set contains almost 10k spam emails), gathered from the same
untroubled honey-pot, are clustered with the use of CCTree algorithm. In this experiment,
the same stopping criteria are applied. Through tables and diagrams we show that the CCtree
preserves the same trend when the datasets are in (almost) the same size.
Considering the proven accuracy and efficiency, the proposed methodology may stand as a
valuable tool to help authorities in rapidly analyzing large amount of spam emails, with the
purpose of finding and persecuting spammers.
To the best of our knowledge we are the first who showed the effectiveness and efficiency of
58
proposed algorithm for clustering spam emails into campaigns, whilst in previous techniques
the proposed techniques were not evaluated.
59
Chapitre 5
Labeling and Ranking Spam
Campaigns
Fast analysis of correlated spam emails may be vital in the effort of finding and prosecuting spammers performing cybercrimes such as phishing and online frauds. In this chapter
we present a self-learning framework to automatically divide and classify large amounts of
spam emails in correlated labeled groups. Building on large datasets daily collected through
honey-pots, the emails are firstly divided into homogeneous groups of similar messages (campaigns), which can be related to a specific spammer. Each campaign is then associated to a
class which specifies the goal of the attacker, i.e. phishing, advertisement, etc. The proposed
framework exploits a categorical clustering algorithm to group similar emails, and a classifier
to subsequently label each email group. The main advantage of the proposed framework is
that it can be used on large spam emails datasets, for which no prior knowledge is provided.
The approach has been tested on more than 3200 real and recent spam emails, divided in
more than 60 campaigns, reporting a classification accuracy of 97% on the classified data.
Afterwards, a ranking approach is proposed to automatically rank spam campaigns according
to investigator priorities ([128]).
5.1
Introduction
At the end of 2014, emails are still one of the most common form of communication in Internet.
Unfortunately, emails are also the main vector for sending unsolicited bulks of messages,
generally for commercial purpose, commonly known as spam. The research community has
investigated the problem for several years, proposing tools and methodologies to mitigate this
issue. However, a definitive solution to the problem of spam emails still has to be found.
Unfortunately, the problem of spam emails is not only related to unsolicited advertisement.
Spam emails have become a vector to perform different kind of cybercrimes including phishing,
cyber-frauds and spreading malware.
60
Motivation : Trying to filter spam emails at the user end, actually is not enough to fight
this kind of attacks, which moves the effect of unsolicited spam emails from illicit to real
crime. Finding the spammers becomes important not only to tackle at the source the problem
of spam emails, but also to legally prosecute the responsible of cybercrimes brought by spam
emails different from undesired advertisement. To identify spammers, the early analysis of huge
amount of messages to find correlated spam emails with the specific spammer purpose is vital.
Several papers in the literature observed that the forensic analysis, which plays a major role in
finding and persecuting spammers for cybercrimes, needs a proactive mechanism or tool which
is able to perform a fast, multi-staged analysis of emails in a timely fashion [63], [44], [144],
[40]. To this end, large amounts of spam emails, generally collected through honey-pots should
be at first divided in similar groups, which could be related to the same spammer (i.e., spam
campaigns). Afterward to each campaign should be assigned a label describing the purpose of
spammer. This goal-based labeling facilitates for investigators the analysis of spam campaigns,
eventually directed toward a specific cybercrime, Moreover, the spam campaign labeling based
on the goal of spammer can help to rank spam campaigns. However, this analysis generally
appears to be a challenging task. In fact, considering the number of produced spam emails
and their variance, spam email datasets are huge and very difficult to handle. In particular,
human analysis is almost impossible, considering the amount of spam emails daily caught by a
spam honey-pot [141] [144]. On the other hand, an automated and accurate analysis requires
the usage of correctly trained computational intelligence tools, i.e. classifiers, whose training
requires accurately chosen datasets, which presents to the classifier a good reality description
in which it will be employed. Moreover, due to the high variance of spam emails, a valid
training set may become obsolete in few weeks, and a new up-to-date training set must be
generated in a short period of time.
Though previous work largely improved the state of the art in analysis of spam emails for
forensic purposes, more improvement is still needed. In particular, previous work either focuses
on a specific cybercrime only, especially phishing [50], or exploit in the analysis a small set
of features not effective in identifying some cybercrime emails. For example, the analysis of
email text words [63], link domains [44] is not effective in identifying emails used to distribute
malware, which often do not contain text, or spam emails with dynamic links [16].
Contribution : In this chapter, we propose Digital Waste Sorter (DWS), a framework which
exploits a self learning goal of the spammer -based approach for spam email classification. The
proposed approach aims at automatically classifying large amount of raw unclassified spam
emails dividing them into campaigns and labeling each campaign with its spammer goals. To
this end we propose five class labels to group spammer goals in five macro-groups, namely
Advertisement, Portal Redirection, Advanced Fee Fraud, Malware Distribution and Phishing.
Moreover, a set of 21 categorical features representative of email structure is proposed to
perform a multi-feature analysis aimed at identifying emails related to a large range of cyber-
61
crimes. DWS is based on the cooperation of unsupervised and supervised learning algorithms.
Given a set of classes describing different spammer goals and a dataset of non classified spam
emails, the proposed approach at first automatically creates a valid training set for a classifier
exploiting a categorical clustering algorithm, named CCTree (Categorical Clustering Tree). In
more detail, this clustering algorithm divides the dataset into structurally similar groups of
emails, named spam campaigns [26]. DWS is built on the results of CCTree , which is effective in dividing spam emails in homogeneous clusters. Afterward, significant spam campaigns
useful in the generation of the training set are selected through similarity with a small set of
known emails, representative of each spam class. Hence, a classifier is trained using the selected
campaigns as training set, and will be used to classify the remaining unclassified emails of the
dataset.
To further meet the needs of forensic investigators, which have limited time and resource to
perform email examinations [40], the DWS methodology does not require a prior knowledge of
dataset, except the desired classes (i.e. spammer goals) and a small set of emails representative
of each class. It is worth noting that this email set cannot be used to train the classifier. In
fact, this set contains a small number of emails not belonging to the dataset to be classified,
being thus not necessarily descriptive of the reality in which the classifier will operate.
In the following we will describe in details the DWS framework, explaining the process of
division in campaigns, training set generation and campaigns classification. The framework
effectiveness has been evaluated against a set of 3200 recent raw spam emails extracted by a
honey pot. DWS reported a classification accuracy on this preliminary dataset of 97.8%. Furthermore, to justify the classifier selection, an analysis of performances on different classifiers is
presented. Furthermore, we propose five features, including the label of campaigns discovered
with DWS, to automatically rank a set of spam campaigns according to investigator priorities.
The rest of the paper is organized as follows. Section 5.2 reports related work on email classification. Section 5.3 presents the DWS framework and work-flow in details, also it gives brief
background information on the clustering algorithm. Section 5.4 presents the results of the
analysis on a real dataset of spam emails, as well as a comparison on the classification results
of four different classifiers. In Section 5.7 a technique is proposed to rank spam campaigns.
Finally Section 5.6 briefly concludes reporting planned future extensions.
5.2
Related Work
In the literature, the spam campaigns are usually labeled based on characteristic strings (keywords) representing individual campaign types as in [44], [88] and [55]. These approaches are
weak against the kind of spam emails which do not contain keywords or that use word obfuscation techniques. [106] label spam campaigns on the base of URLs, phone number, Skype ID,
and Mail ID used as contact information. This methodology is effective only against emails
62
reporting contacts, which are only a subset of all the spam emails found in the wild. This
means that the proposed approachh fail in detecting spam campaigns not containing the aforementioned contact information.
There are several approaches in the literature in which the spammer goal is considered. However, these approaches are mainly focused on detecting phishing emails, not considering other
spammer purposes. Fette et al. [50] applied 10 email features to discern phishing emails from
ham (good) emails. Bergholz et al. [17] propose a similar methodology with additional features to train a classifier in order to filter phishing emails. Almomani et al. [3] provide a
survey on different techniques in filtering phishing emails, while Gansterer et al. [53] compare
different machine learning algorithms in phishing detection. Furthermore, the authors propose
a technique which refines the previous phishing filtering approaches. In this work, three types
of messages, named ham, spam and phishing are distinguished automatically. Nevertheless,
the category of emails containing spam, is not precisely characterized. In [34] a methodology
to detect phishing emails based on both machine learning and heuristics is proposed. These
approaches report accuracy ranging from 92% to 96%, where the classifiers have been trained
through labeled datasets. On the contrary, DWS generates the training set on the fly, without
requiring a pre-trained classifier. Notwithstanding, in the performed experiments DWS shows
comparable accuracy.
5.3
Digital Waste Sorting
DWS is a Java-based framework which takes as input datasets of unclassified spam emails.
Hence, DWS divides the emails in campaigns by mean of a hierarchical clustering algorithm,
then labels each campaign through a classifier. The classifier is trained on the fly, through
a training set generated by DWS directly from the unlabeled input dataset, exploiting the
knowledge generated by the clustering algorithm.
This section describes in details the DWS framework and methodology. First we will present
the classes used to label each spam campaign. Then, we present the feature extraction process
from raw emails, discussing the features relevance in describing structural elements of an email
and their relation to each spam class. The framework is then presented, briefly introducing
the clustering algorithm and the methodology for the generation of the training set. Finally
the classification process is presented.
5.3.1
Definition of Classes
As anticipated, spam emails can be sent with different intentions, spanning from the common
advertisement to vectors of different cybercrimes. We argue that spam emails can be divided
in five well known macro-groups which represent the main target of spammers, and can thus
be used to label spam campaigns.
63
Figure 5.1 – Advertisement
Advertisement : The advertisement class contains those emails whose target is convincing a
user to buy a specific product [84]. Advertisement emails embody the most typical idea of
spam messages, advertising any kind of product which could be of interest of companies
or private users. Generally these emails only constitute a hindrance to the users that have
to spend time removing them from the inbox. Notwithstanding, some campaigns provide
revenues up to 1M US dollars to spammers per month [84]. The main requirements for a
commercial email to be legal according to Federal Trade Commission [49], is that it uses
no deceptive subject lines, provides correct complete header information, real physical
location of the business, offers an opt-out choice, and honors opt-out requests in 10
business days. In present work, we consider as advertisement emails both the ones which
comply with the legal requirements and the ones that does not, given that their purpose
is clearly advertising a product.
The first time that spam came under consideration as business, was in April 1994. Two
lawyers from Phoenix, named Canter and Siegel, hired a programmer to distribute their
“Green Card Lottery Final One !” message to as many newsgroups as possible. The
interesting point in this act was that they did not hide the fact that they were the
spammers, and even instead they were proud of it. Canter and Siegel decided to write
a book with the title of “How to Make a Fortune on the Information Superhighway :
Everyones Guerrilla Guide to Marketing on the Internet and Other On Line Services”.
Moreover, planned to open a consulting company to teach others, and help them, to
post their own advertisement, which never took off.
However, still, in 2015, spam emails are as one of the most popular tools for advertising
the goods.
Figure 5.1 shows a typical sample of avertisement spam email, containing several pho-
64
tos, and prices, which clicking on each photo directs the user to the spammer website,
convincing him to buy a product or service.
Portal Redirection : Portal redirection spam emails are the enablers of an evolved advertisement methodology. This spam emails are characterized by a minimal structure generally reporting one or more links to one or more websites. Once the user clicks on the
link, she is redirected several times to different pages whose address is dynamically generated. The final target page is mostly an advertisement portal with several links divided
by categories, generally related to common user needs (e.g., medical insurance). This
strategy is useful in reducing the legal responsibility on spam emails of the companies
which are advertising a product. The rationale is that the advertised company cannot
be sued because another website, i.e. the portal, links to it. As an example, the optout clause of advertisement emails [83] does not apply. Moreover, the multi-redirection
with dynamic links strategy makes difficult to track the responsible websites. It is worth
noting that the strategy of portal redirection emails, is also used to redirect users on
websites with the intention of defrauding the users, or to distribute malicious code, and
also increasing the visits of a web page.
The first redirect service 1 , in 1999, got the advantage of top-level domains (TLD), like
".to" (Tonga), ".at" (Austria) and ".is" (Iceland). The aim was to make memorable
URLs. The first redirect service return to V3.com that redirected 4 million users at its
peak in 2000. The success of V3.com was resulted from the fact that it contained a
wide variety of short memorable domains including “r.im”, “go.to”, “i.am”, “come.to” and
“start.at”. Due to the fact that the sales price of top level domains started falling from
$70.00 per year to less than $10.00, the use of redirection services got declined.
Spamdexing, is an attack with the purpose of indexing web page. The goal of a web
designer is to create a web page that will find favorable rankings in the search engines,
and the designers create their own web pages on the base of the standards that they
think it will succeed. Spam emails are a good place for embedding the links desired to
get the higher score of visits. To this end, portal redirection technique can be applied to
redirect the users to several desired web pages.
Figure 5.2 demonstrates a typical form of a portal redirection spam email, containing
several hyper-links, hiding under a luring text, deceiving the user to click.
Advanced Fee Fraud : An advanced fee fraud or confidence trick spam email (synonyms
include confidence scheme or scam) attempts to defraud a person after first gaining
their confidence, used in the classical sense of trust [71]. Confidential trick spam exploits
social engineering to trick the user in paying, by her own will, a certain amount of
money to the attacker. Scammers may use several techniques to deceive the user in
1. http ://news.bbc.co.uk/2/hi/technology/6991719.stm
65
Figure 5.2 – Portal
Figure 5.3 – Fraud
paying money, generally exploiting sentimental relations or promising a large amount of
money in return. The confidential trick emails, mostly are written in a friendly long text,
to convince the victim the interactions. These kinds of emails, usually, do not redirect
the users to other web pages, mainly contain an email address.
419 scams [61] are one of the most common types of confidence tricks, that dates back
to the late 18th century. The confidential trick scam typically promises the victim a
significant share of a large sum of money, which the spammer requires a small in-advance
payment to obtain. If a victim provides the payment, the fraudsters either asks further
fees from the victim, or simply disappears. In these cases, the emails’ subject line often
contains something like “From the desk of barrister [Name]”, “Your assistance is needed”,
and so on.
When the victim’s confidence has been achieved, the scammer then introduces a problem
that prevents the deal from occurring as planned, such as “To transmit the money, we
66
need to bribe a bank official. Could you help us with a loan ?” or similar. Although being
difficult to evaluate the success rate of fraud spam campaign, one individual estimated
that he sent 500 emails per day and he received about seven replies, mentioning that
when he received a reply, he was 70 percent certain that he would get the money 2 .
The lottery scam is another well-known sample of confidential trick spam emails. It
contains fake notice of lotter win. The winner is usually asked to send sensitive information such as name, residential address, occupation/position, lottery number etc. to
a free email. Then the spammer informs the victim that releasing the money requires
some small fee (insurance, registration, or shipping). When the victim sends the fee,
the scammer asks another fee to be paid 3 . In the UK, lottery scams become such a big
problem that many legitimate lottery sites dedicated pages on the subject to address
the issue 4 .
Figure 5.3 represents a typical confidential trick spam email, which through a long text
tries to earn the victim confidence. To this end, a friendly long text is written to earn
the reader’s confidence.
Malware : Emails are an important vector for spreading malicious software or malware.
Generally the malware is sent as email attachment, while the email structure is very
simple, with a small text which encourages the reader to open the attachment or no
text at all 5 . Once opened the malware infects the user device, showing different possible malicious behaviors. Commonly, the malware transform the victim’s device in a
bot which is used to send spam messages to other prospective victims, which can be
chosen by the spammer, or even being part of the victim’s contact list. To this category
belongs Command and Control malware and Worms [104]. Often the malicious file is
camouflaged, inserted in a zip file or with a modified extension, which allows to deceive
basic anti-virus control implemented by some spam filters.
Figure 5.4 shows a typical representation of malware spam email, where mostly contains
an attachment, convincing the user with luring sentences to open it. Notice that it is
also possible that malware campaign be designed in the format of portal redirection. By
the way, here, when we talk about malware campaign, we mean that from the layout
of spam messages, we are almost sure that the spam campaign has been designed for
malware distribution.
One of the very well-known malware spam campaign, is titled “Melissa.A”, a virus with
a woman’s name, appearing on 26 March 1999 in the United States. Melissa.A came
with the message “Here is the document you asked me for . . . do not show it to anyone”.
The virus came through email including an MS Word attachment. once opened, it was
2.
3.
4.
5.
www.articles.latimes.com
www.snopes.com
www.lottery.co.uk/scams
http ://www.symantec.com
67
Figure 5.4 – Malware
emailing itself to the first 50 people in the MS Outlook contact list. In a few days, it
became one of the most important cases of massive infection in history, causing damage of
more than 80 million dollars to American companies, such that companies like Microsoft,
Intel and Lucent Technologies blocked their Internet connections to be protected from
Melissa.A. The virus infected up to 20% of computers worldwide.
The ILOVEYOU virus is another malware spam campaign attack, that many considered
to be the most damaging virus ever written. It distributed itself by email in 2000 through
an attachment in the message. When opened, it loaded itself to the memory, infecting
executable files. Once a user received and opened the email containing the attachment
“LOVE-LETTER-FOR-YOU.txt.vbs”, the computer became automatically infected. It
then infected executable files, image files, audio files, etc. Afterwards, it sent itself to
others by looking up the addresses contained in the MS Outlook contact list. It caused
billions of dollars in damages.
CryptoLocker 6 is a new malware campaign, as ransomware trojan which targeted computers running Microsoft Windows, first distributed in Internet on 5 September 2013.
When activated, the malware encrypts several types of files stored on local and mounted network drives with the use of RSA public-key cryptography, with the private key
stored only on the malware’s control server. Afterwards, the malware shows a message
which offers to decrypt the data if a payment (through either Bitcoin or a pre-paid cash
voucher) is made by a stated deadline, and the victim threatened to delete the private
key if the deadline passes. Figure 5.5 shows the increase of Crypto ransomeware from
2013 to 2014 7 .
6. www.arstechnica.com
7. www.symantec.com
68
Figure 5.5 – Crypto Ransomeware volume
Phishing : Phishing emails attempt to redirect users to websites, which are designed to
obtain credentials or financial data such as usernames, passwords, and credit card detail
illegally [3]. Generally, these emails pretend to be sent by a banking organization, or
coming from a service accessible through username and password, e.g. social networks,
instant messaging etc., reporting fake security issues that will require the user to confirm
her data to access again the service. To this end, phishing emails are mostly very well
presented with a well organized structure, even reporting contact informations such as
phone numbers and email. The representative structure of phishing emails we applied
in this research, contain short well written text, providing the victim some important
news. Mostly there exists one link, which direct the user to a very well designed fake
website of a bank, which directly asks the victim to provide her credit card information.
On 26 January 2004, the U.S. Federal Trade Commission proposed the first lawsuit
against a suspected phisher. It started from a Californian teenager, who created a webpage designed to look like the America Online website, and used it to steal credit card
information 8 . Other countries have followed this lead by tracing and arresting phishers.
A phishing kingpin, Valdir Paulo de Almeida, was arrested in Brazil for leading one of
the largest phishing crime rings, which in two years stole between $18 million and $37
million 9 .
Phishing, still in 2015, are one of the most dangerous effective kind of spam emails,
requiring extensive efforts to fight against.
Figure 5.6 demonstrates a typical sample of phishing spam email, mostly well designed
to be as much as possible to seem real as referred organization that it pretends to come
8. edition.cnn.com/2003/techinternet/07/21/phishing.scam
9. www.channelregister.co.uk
69
Figure 5.6 – Phishing
from.
5.3.2
Feature Extraction
DWS parses raw spam emails (eml files) extracting a set of 21 categorical features building a
numerical vector readable by clustering and classification algorithms. The extracted features
are reported in Table 5.1. Worth noticing that Table 5.1 and 4.1 are identical, and here we
just bring it again to relate spammer goal and set of features as what follows.
The “number of recipients” which are in the To and Cc fields of the email differentiate between
emails which should look strictly personal, e.g. communications from a bank (phishing) and
those that pretend to be sent to several recipients, such as some kind of frauds or advertisement.
The structure of links in the email text gives several information useful in determining the
email goal. Portal redirections emails and advertisement generally show a high “Number of
links”, in the first case to redirect the user to different portal websites, in the second one to
redirect the user to the website where she can buy the products. Generally, fraud emails do not
report links except for “IP based links”. These links are expressed through IP addresses, without
reporting domain names, to reduce the likelihood of being tracked or to make the email text,
generally discussing about secret money transaction, more legitimate. The “number of domains
in links” represents the number of different domains globally found in all the links in the email
text. Phishing and advertisement emails generally have just a single domain respectively of
the website where to buy the advertised product and the website of the authority which the
message pretends to be sent from. On the other hand portal redirection may contain several
domains to redirect the reader to different portal websites. Moreover, links in portal redirection
emails generally have a high “average number of dots in links” (i.e. sub-domains) and being
dynamically generated are likely to contain hexadecimal or non ASCII - characters. Non ASCII
70
Attribute
RecipientNumber
NumberOfLinks
NumberOfIPBasedLinks
NumberOfMismatchingLinks
NumberOfDomainsInLinks
AvgDotsPerLink
NumberOfLinksWithAt
NumberOfLinksWithHex
NumberOfNonAsciiLinks
IsHtml
EmailSize
Language
AttachmentNumber
AttachmentSize
AttachmentType
WordsInSubject
CharsInSubject
ReOrFwdInSubject
SubjectLanguage
NonAsciiCharsInSubject
ImagesNumber
Description
Number of recipients addresses.
Total links in email text.
Links shown as an IP address.
Links with a text different from the real link.
Number of domains in links.
Average number of dots in link in text.
Number of links containing “@”.
Number of links containing hex chars.
Number of links with non-ASCII chars.
True if the mail contains html tags.
The email size, including attachments.
Email language.
Number of attachments.
Total size of email attachments.
File type of the biggest attachment.
Number of words in subject.
Number of chars in subject.
True if subject contains “Re” or “Fwd”.
Language of the subject.
Number of non ASCII chars in subject.
Number of images in the email text.
Table 5.1 – Features extracted from each email.
characters in the links are also typical of some advertisement emails redirecting to foreign
websites. It is worth noting that all these link based features consider the real destination
address, not the clickable text shown to the user. If the clickable text (hyper-link) shows an
address (“click here”-like text is not considered) different from the destination address, the link
is considered mismatching and counted through the feature “mismatching links”. Phishing and
portal redirection emails make extensive use of mismatching links to deceive the user.
Advertisement and phishing emails may appear like a web-page. In this case, the email contains
HTML tags. On the other hand, fraud, malware and portal emails rarely are presented in
HTML format. The size of an email is another important structural feature. Confidential trick
and portal redirections generally are quite small in size, considering they are raw text. Advertisement, malware and some kind of phishing emails generally have a more complex structure,
including images and/or attachments, which makes the message size to noticeably grow. “Attachment Number”, “Attachment Size” and “Attachment Type” are structural features mainly
used to distinguish between the attachment of malware emails and those of advertisement and
phishing emails, which attach to the email images for a correct visualization. The “Number
of Images” in an email determines the global look of the message. Images are typical of some
advertisement emails and phishing ones. Finally three features are used for the analysis of
the subject. For example, some advertisement emails use several one-character words or non
ASCII characters in emails to deceive typical spam detection techniques based on keywords
71
Table 5.2 – Feature vectors of a spam email for each class.
Class
Advert.
Portal
Fraud
Malware
Phishing
NumAttach
0
0
0
1
0
Typeattach
0
0
0
5
0
NumLinks
11
10
0
0
2
NumImages
12
0
0
0
0
NumDomains
2
1
0
0
2
EmailSize
14
10
10
21
9
SubjLang
10
1
1
1
1
CharsInSubj
3
3
1
2
3
Lang
10
1
1
1
1
[123]. It is worth noting that rarely non ASCII characters are used in phishing emails, to
make them look more legit. Moreover, some fraud and phishing emails send deceiving mail
subject with the “Re” : or “Fwd” : keyword to look like part of a conversation triggered by the
victim. Furthermore, some fraud emails are characterized by the difference between the email
“Language” and the “Subject Language”. Many scam emails are, in fact, translated through
automatic software which ignore the subject, causing this language duality.
For a further insight, Table 5.2 shows the vectors of some selected features extracted from the
five emails of Figures 5.1, 5.2, 5.3, 5.6, 5.4.
5.3.3
DWS Classification Workflow
After the email features have been extracted, the resulting feature vectors are given as input
to the DWS classification workflow. This process aims at dividing the unclassified spam emails
in campaign and label them through a classifier trained on the fly. The classifier can be applied
to label new spam emails. To get better insight, the workflow of proposed approach is depicted
in Figure 5.7.
The main part of the workflow is aimed at generating a valid training set from the dataset
of unclassified emails, applying hierarchical clustering algorithm to divide email in campaigns
(step 1 in Figure 5.7). The chosen algorithm, named Categorical Clustering Tree (CCTree)
generates a tree like structure (step 2) which is exploited to associate a campaign to each
email coming from a small dataset of labeled emails. The campaign receives the label of the
email associated to it (step 3). Thus, this set of campaigns is used as training set for a classifier
(step 4), successively used to label all the remaining campaigns (step 5 and 6).
The framework is based on a clustering algorithm (CCTree) and a classifier. As discussed,
classifiers are generally more accurate than clustering algorithms, due to the supervised learning approach. However, the major drawback of the classifiers is that a valid training set, with
enough elements and representative of the reality is not always available. We argue that it is
possible to create such a training set with exploiting the CCTree algorithm and a small set
of classified emails (C). The C set contains emails representatives of each class, however, the
number of elements is not enough to constitute a valid training set for a classifier.
72
Figure 5.7 – DWS Workflow.
The CCTree algorithm starting from dataset D generates a decision tree-like structures whose
leaves are the final clusters, unlabeled. Following the CCTree structure it is possible to collocate
the emails of the set C in the unlabeled clusters of the set D, to find similarly structured emails.
In fact, in the problem of clustering spam emails, each cluster represents a set of homogeneous
similar spam emails, i.e. a spam campaign. Thus, for the purpose of goal-based labeling, all
emails belonging to the same cluster will receive the same label. Finally, the emails of these
homogeneous clusters can be used as training set for the supervised learning classifier. After
the classifier has been trained, it is used to classify the remaining leaves of the CCTree that
were not reached by any email of the set C.
Figure 5.7 schematically depicts the typical operative work-flow of the proposed framework.
In the following the six steps of the DWS workflow are described in detail.
Phase 1 : Clustering Spam Emails into Campaigns
The first step performed by the DWS framework is to divide large amounts of unclassified
spam emails (constituting the set D) into smaller groups of similar messages (steps 1 and 2 in
Figure 5.7). Emails are clustered by structural similarity exploiting the CCTree algorithm.
73
Figure 5.8 – Insert new instance X in a CCTree
Phase 2 : Training Set Generation
In order to label the campaigns, it is necessary to train a classifier to recognize emails coming
from the five predefined spam classes (steps 3 and 4 in Figure 5.7). To this end, it is necessary
to provide to the classifier a good training set, which has to be representative of the reality
in which the classifier has to operate. For this reason the training set will be extracted from
the unclassified emails dataset D itself. More specifically, the CCTree structure generated in
previous step is exploited to label a small number of generated spam campaigns. To this end,
small number of campaigns are labeled with the use of a small set of labeled emails C. This
set contains a small number of manually selected spam emails, equally distributed in the five
classes, all structurally different. These spam emails do not come from the D set. The emails
in the C dataset have to be accurately chosen on the base of the email that investigator are
interested in. For example, Italian police investigators interested in following a phishing case
should put in the C dataset some emails with Italian text and bank names. After extracting
the value of the features from the email in C, they are fed one by one to the CCTree generated
on D. Following the CCTree structure each email ci is eventually inserted in the campaign Cj
(Figure 5.8). Thus the campaign Cj is labeled with the class of ci and all its emails are added
to the training set.
If the same spam campaign is reached by two or more emails of different classes, the campaign
is discarded and the emails are re-evaluated to be sent to other campaigns. It is worth noting
that such an event is unlikely due to the high homogeneity of the clusters generated through
CCTree. Furthermore, in the event that an email in C does not reach to any campaign, i.e.
a specific attribute value of the email is not present in the CCTree, the email is inserted in
the more similar campaign. To this end, the node purity of each campaign is calculated before
and after the insertion of the email ci . The email is thus assigned to the campaign in which
the difference between the two purities, weighted by the number of elements, is lesser.
Phase 3 : Labeling Spam Campaigns
74
Feeding the training set to the classifier, we are able to classify all remaining campaigns
generated by the CCTree (step 5 and 6 in Figure 5.7). To this end, each campaign resulted
from CCTree is given to the classifier. The classifier labels each email of received campaign on
the base of spammer purpose. Under two conditions DWS considers a spam campaign as non
classified.
Firstly, it is possible that more emails belonging to the same campaigns receive different
labels, e.g. phishing and portal redirection. In such a case, calling as “majority class” the label
with more emails in the cluster, the campaign is considered non classified if the emails of the
majority class amount to less than 90% of all the emails in the campaign. classified.
The second condition is instead related to the prediction error reported by the classifier on each
element of a campaign. The predicted error, computed as 1 − P (ei ∈ Ωj ), where P (ei ∈ Ωj ) is
the probability that the element ei belongs to the class Ωj , i.e. the label assigned to the element
ei . DWS framework considers a campaign as non classified, if the average predicted error is
more than 30%. If the non classified campaigns and/or elements are a consistent percentage,
it is possible to restart the classification process running the CCTree algorithm with tighter
criteria for node purity.
5.4
Results
This section presents the experimental results of the DWS framework. First we discuss the
classifier selection process, exploiting two small datasets of manually labeled spam emails.
Afterward, we present the results for a real use case of the DWS framework on a recent
dataset of spam emails.
5.4.1
Classifier Selection
In this first set of experiments we compare the performance of three different classifiers. To this
end, two sets of real spam emails are provided to be used as training and test sets. These two
datasets are extracted from emails collected by the untroubled honey pot 10 in February and
January 2015. The emails have been manually analyzed and labeled for standard supervised
learning classification and performance evaluation. The manual analysis and labeling process
has been performed rigorously analyzing text and images, and following the links in each email.
Only the emails for which the discovered class was certain have been inserted to the datasets.
For a spam email, the label is certain if it matches the label description given in Section 5.3.1
and the label is verified through manual analysis. For example, Portal Redirection emails are
certainly labeled if the links really redirect to a portal website. The first dataset, used as
training set, is made of 160 spam emails, the second one, used as test set, is made of 80 emails.
10. http ://untroubled.org/spam
75
Experiments have been run on all the classifiers offered by the WEKA library to classify
categorical data. For the sake of brevity and clarity we only report the classifier with the
better results for each classifier group. More specifically, the chosen classifiers are the K-Star
from the Lazy group, the Random Tree Forest from the Tree group and the Bayes Network
from the NaiveBayes group. Among these three classifiers, the best one has been used by the
DWS framework in its operative phase.
Dataset Dimensioning
The process of manual analysis and labeling is time consuming. However, it is necessary to have
a dataset well balanced, without duplicates and representative of the five classes, needed to
correctly assess classifier performances. Given the complexity of manual analysis procedure, it
is not possible to choose training and testing set of extremely large dimension. Thus, standard
dimensioning techniques have been used, for both training and testing set. A general rule
to assess the minimum size for a training set is to dimension it as six times the number of
used features [140]. It is worth noting that the training set of 160 elements already matches
this condition (6 × 21 < 160). However, in multi-class problem, the dimension of data should
provide well result in terms of sensitivity and specificity, i.e. true positive rate (TPR) and
(1 - false positive rate (FPR)) respectively, when K-fold validation is applied [14]. This must
be done keeping balanced the relative frequencies of data in various classes. As shown in the
following, the provided testing set returns for K-fold validation a value of Receiver Operating
Characteristic’s Area Under Curve (ROC-AUC or AUC for short) higher than 90% for all
tested classifiers.
Concerning the test set, it is important the null intersection with the training set and the
balanced relative frequencies of the various classes. In [14], the minimum size for a testing set
to provide meaningful results, in a problem of classification with five classes, is estimated to
be 75, which is smaller than the test set of 80 spam emails provided.
Classification Results
We report now the classification results for the three tested classifiers on the two aforementioned dataset. The first set has been used as training set for the classifiers. According to the
methodology in [14], a first performance evaluation has been done through the K-fold (K=5)
validation method, classifying the data for K times using each time K − 1/K of the dataset
as training set and the remaining elements as testing set. The used evaluation indexes are the
True Positive Rate (TPR), False Positive Rate (FPR) and Receiver Operating Characteristic
Area Under Curve (ROC-AUC or simply AUC). The AUC is defined in the interval [0, 1] and
measures the performance of a classifier at the variation of a threshold parameter T , proper
of the classifier itself, according to the following formula :
Z ∞
AU C =
T P R(T ) · F P R0 (T )dT
−∞
76
Table 5.3 – Classification results evaluated with K-fold validation on training set.
Algorithm
True Positive Rate
False Positive Rate
Area Under Curve
K-star
0.956
0.01
0.996
RandomForest
0.937
0.019
0.992
BayesNet
0.95
0.013
0.996
where F P R0 = 1 − F P R. When the value of AUC is equal to 1, the classifier is considered
“perfect” for the classification problem.
Table 5.3 reports TPR, FPR and AUC the three classifiers, i.e. the number of correctly classified elements between the five classes for both the K-fold test on the first dataset (160 spam
emails). As shown, all the classifiers return an accuracy higher than 90%.
Table 5.4 – Classification results evaluated on test set.
Algorithm
Measure
Advertisement
Portal
Fraud
Malware
Phishing
Average
TPR
1.000
0.786
1.000
0.938
0.947
0.9342
K-star
FPR
0.031
0.000
0.016
0.016
0.017
0.016
AUC
0.998
0.996
0.992
0.995
0.977
0.9916
RandomForest
TPR
FPR AUC
1.000 0.000 1.000
0.786 0.016 0.985
1.000 0.016 0.951
0.938 0.016 0.908
0.947 0.051 0.963
0.9342 0.019 0.9614
BayesNet
TPR
FPR AUC
1.000 0.031 0.967
0.929 0.000 0.998
1.000 0.016 0.928
0.938 0.016 0.957
0.842 0.017 0.907
0.9418 0.016 0.9514
Afterward, the whole first dataset has been used to train the three classifiers, whilst the second
dataset has been used as test set. Table 5.4 reports the detailed classification results where
the classifiers are trained with training set (160 spam emails) and evaluated with test set (80
spam emails). The result is reported on the five classes with TPR, FPR and AUC.
For a further insight, we report in Figures 5.9, 5.10, 5.11, 5.12, and 5.13 the comparison of the
ROC curves of the three classifiers for the five classes, measured on the test set.
It is worth noting that in all cases the area under the ROC curve is close to 1, hence in general
classifiers shows good performances on the testing set for each class.
As can be observed in Table 5.3, on the average the K-star and Bayes Net classifiers give
slightly better K-fold results. However, the K-star classifier yields the better results in terms
of AUC in average, evaluated with test set ( Table 5.4). Therefore, K-star is the desired
classifier which we implement in DWS framework.
77
Figure 5.9 – ROC curve / Advertisement
Figure 5.10 – ROC curve / Portal Redirection
78
Figure 5.11 – ROC curve / Fraud
Figure 5.12 – ROC curve / Malware
79
Figure 5.13 – ROC curve/ Phishing
5.4.2
DWS Application
The second set of experiments aims at assessing the capability of the framework to cluster and
label large amounts of spam emails. To this end the DWS framework has been tested on set
of 3230 recent spam emails. The spam emails have been extracted from the collection of the
honeypot 11 , related to the first week of March 2015. The emails have been manually analyzed
and labeled for performance analysis.
Phase 1 : Clustering with CCTree
In the first step CCTree has been used to divide the emails in campaigns. The CCTree parameters have been chosen finding the optimal values for number of generated clusters and
homogeneity, using the knee method described in Chapter 4. Applying CCTree, 135 clusters
have been generated of which 73 only contains one element. Generated clusters with a single
element have not been considered. These emails are, in fact, outliers which do not belong to
any spam campaign. The remaining 3149 emails divided in 62 clusters have been used for the
following steps.
Phase 2 : Training Set Generation
To generate the training set we used a small dataset made of three representative emails for
each of the five classes. These 15 emails have been manually selected from different datasets
of real spam emails, including personal spam inbox of the authors. To facilitate the manual
11. http ://untroubled.org/spam
80
analysis of the classified spam emails, the 15 emails of the set C are written in English language.
Each email has been assigned to one of the 62 spam campaigns, following the CCtree structure,
as described in Section 5.3.3. The campaigns associated to each email are used as training set.
Table 5.5 – Training set generated from small knowledge.
Class
Advert.
Portal
Fraud
Malware
Phishing
Total
Number of Emails
29
66
113
27
17
252
Number of Campaigns
2
3
3
1
1
10
The generated training set (Table 5.5) is composed of 252 emails, contained in 10 campaigns.
It is worth noting that the 15 emails have not been added to the associated cluster after the
CCTree classification, to not alter the decision on the following emails.
Phase 3 : Labeling Spam Campaigns
After training the classifier with the generated training set, we label the remaining (52 out of
62) unlabeled spam campaigns of CCTree. The classification results are reported in Table 5.6.
Table 5.6 – DWS classification results for the labeled spam campaigns.
Class
Advert.
Portal
Fraud
Malware
Phishing
Total
Campaigns
Correct
5
26
10
3
7
51
Wrong
0
0
2
0
1
3
Emails
Correct
137
1331
1032
31
213
2744
Wrong
0
0
43
0
18
61
TPR
FPR
Accuracy
1
1
0.96
1
0.915
0.975
0
0.03
0.01
1
0
0.008
1
0.9935
0.9788
1
0.994
0.9782
The table reports for each class the amount of campaigns and corresponding email classified
correctly or incorrectly. Moreover, we report for the emails the statistics on TPR, FPR and
Accuracy (i.e., the ratio of correctly classified elements). The global accuracy, (last row of
the table) is of 97,82%. However, we point out that, due to the conditions on predicted error
reported in Subsection 5.3.3, 8 campaigns out of 62, containing 344 spam emails are considered
unclassified. For the sake of accuracy, considering these 8 campaigns as misclassified, the total
accuracy for emails on the dataset is of 87,14%. The accuracy result is in the line with previous
works on classification emails into phishing and ham [34], [50], [17].
Concerning the 8 non classified campaigns, 3 campaigns containing 68 spam emails were correctly labeled as portal. However, they are considered unclassified since the average predicted
81
error is higher than 30% in all the 3 campaigns. 4 campaigns containing 258 spam emails
have been classified as phishing. 2 of them with 116 messages, were correctly identified but
did not match the predicted error condition. The other 2 campaigns have been incorrectly
classified as fraud. However, they are considered as unclassified due to high predicted error.
The last campaign with 18 elements is in the advertisement class, but incorrectly classified as
fraud, though the predicted error condition again is not matched. It is worth noting how the
condition on predicted error is useful in increasing the overall accuracy on classified data.
From Table 5.6 it is possible to infer what a large portion of spam messages belongs to portal
and fraud classes. Even if these preliminary results are related to a relatively small dataset,
they are indicative of the current trend of spam emails distribution, which may provide to the
spammer the greatest result with the smallest risk.
5.5
Ranking Spam Campaigns
Due to the fact that the number of spam emails collected daily is astronomic, even after
clustering spam emails into smaller similar groups (spam campaigns), still a methodology is
required to automatically order spam campaigns according to investigator priorities. To this
end, in this section, we provide several features (including label of campaigns) and the weight
of features to attribute a grade to each spam campaign. The set of campaigns are ordered
based on their grades. More features can be added to the provided set depending on the case
study.
Ranking spam campaigns helps the investigator to decide which set of spam messages assigned to a specific spammer, needs to be analyzed and prosecuted first. Furthermore, if the
investigator pursue a specific goal, for example dangerous spam campaigns toward Canada,
our proposed ranking methodology can be applied.
5.5.1
Ranking Features
In this section, we propose a set of ranking features, containing five features, to order spam
campaigns. The ranking features are presented in Table 5.7. Afterwards, we explain in detail
what each ranking feature means and how it is normalized to the interval [0,1].
—
—
—
—
—
Number of Data belonging to a Spam Campaign (N )
Domain of URLs (U)
Language of spam message (L )
Burst Property, Analysis of Distribution of Data in a period of Time (B )
Class (Label) of Campaigns (C)
Table 5.7 – Set of ranking features
82
Number of Data (N ) : The number of data in a campaign refers to the number of spam
emails belonging to that campaign. The number of data in campaigns are normalized
based on the maximum number of elements in largest campaign. More precisely, suppose
the campaign containing the maximum number of elements contains nmax spam emails.
The number of data of i’th campaign, containing ni elements, is normalized as Ni =
ni
nmax .
Hence, Ni ∈ [0, 1].
URL Domain of Campaign (U) : The URL domain of spam message is a boolean feature,
which for a spam email equals to 1, if among the URLs in the body of message the desired
domains occurs, and it equals to 0, otherwise.
The URL domain of a spam campaign equals to the portion of spam messages in a
campaign for them URL domain equals to 1.
For example, consider that the investigator is interested in emails oriented to Canada.
In this case, appearing the URL with the domains like “.ca ” in the body of messages
in a campaign makes it more interesting than other campaigns. To this end, a set of
interesting domains are provided as X = {X1 , . . . , .Xk }, then for each message in the
spam campaign respecting one of provided interesting domains, the URL domain of
message equals to 1. The URL domain of spam campaign equal to the number of message
for them the result is equal to 1 divided to whole number of messages in the campaign.
From the definition, the feature U is normalized to [0,1].
Language (L) : The language of message is another criteria which helps to the investigator
interested in spam campaigns oriented to a specific region, e.g. Canada. To this end, a
set of desired languages is provided, e.g. for Canada the set of language may contain
English and F rench. Then, the language of a message equals to 1, if it has been written
in one of desired language, and equals to 0, otherwise. The language of a campaign (L
) equals to the portion of messages for them the language of message equals to 1. From
the definition, the criteria L is normalized to [0,1].
Burst Property (B) : A spam campaign for which the number of spam messages decreased
as the time passes is less dangerous than the one for which the number of produced spam
emails are increasing. We call the criteria of increasing the number of elements of a spam
campaign as burst property of campaign, and we calculate it as dividing the duration of
time between first email (first in terms of time) and last email (last in terms of time) in
a spam campaign into two parts. If the number of emails in second part is more than
the spam messages in first part, we say that the spam campaign respect burst property,
and we attribute 1 to B, otherwise, it is not respecting burst property and we attribute
it 0.
Class (label) of Campaign (C) : Label of campaign (C) is returned from the result of DWS
framework. Here, we propose an approach to attribute a score to each label. Worth
83
noticing that the proposed score for each label can be modified according to investigator’s
priorities.
Phishing spam messages are the most dangerous kind of spam emails, stealing the important information of the victim in a very well presented format. After phishing, Malware
spam are the most dangerous ones, in the sense that mostly the computer of end user
is affected without his awareness. Fraud emails being dangerous enough, are less dangerous than phishing and malware. The reason is that fraud spam messages mostly reach
to their goal through several times of communication, and during this interaction it is
possible that the victim become aware of the risk of continuing communication or it
is possible that a filtering service stop it before that the required money be transferred. Portal emails, mostly not well presented, are generally distinguished by the user as
spam, hence, not dangerous as previous groups. Finally, advertisement spam email which
mostly propose a real product are the least dangerous spam campaigns. Considering that
the campaigns with unknown label are not considered as real dangerous campaigns, we
score the phishing, malware, fraud, portal, advertisement and unknown campaigns as 6,
5, 4, 3, 2, 1, respectively. The score of campaign label is normalized by dividing each
score to 6 (Table 5.8).
Table 5.8 – Normalized score of spam campaigns label
label
normalized score
5.5.2
Phishing
1
malware
0.83
Fraud
0.66
Portal
0.5
Advert.
0.33
Unknown
0.16
Spam Campaign Grade
To attribute a grade to each spam campaign after extracting its ranking features, it is required
to provide a weight for each ranking feature. The weight of a feature is characterized by an
expert, which may vary from one case to another. The weights of features should be normalized
to [0,1], which simply could be achieved by dividing each weight to the sum of weights. The
weighted features shows the importance of each feature in ranking spam campaigns.
We define the grade of campaign C, written grade(C), as following :
grade(C) = ω1 · C + ω2 · N + ω3 · U + ω4 · L + ω5 · B
where C, N , U, L and B are the extracted ranking features of campaign C, ωi ∈ [0, 1] f or1 ≤
i ≤ 5. From the definition grade(C) ∈ [0, 1].
5.5.3
Ranking Application
In this section, we propose an approach to order set of spam campaigns according to the spam
campaigns grades. To this end, first we provide a simple ranking methodology, named dense
84
Table 5.9 – Three first ranked campaigns
Campaign 1
Campaign 2
Campaign 3
Number of Data
1
0.78
0.15
URL Domain
0.96
0.86
0.91
Language
0.98
0.97
1
Burst
1
1
1
Label
0.5
0.66
1
grade
0.88
0.854
0.812
ranking. In dense ranking, objects having the same score receive the same rank. Afterwards,
we explain the experiment of ranking spam campaigns resulted from Section 5.4.2.
Definition 5.1 (Dense Ranking (1223 ranking)). In dense ranking, objects having the same
score receive the same ranking number, and the next object(s) receive the immediate following
ranking number. Hence, each object ranking number is 1 plus the number of object ranked above
it that are distinct respecting to the ranking order. For example, if A ranks ahead of B and C,
where B, C rank equal, and both rank ahead of D, then A gets ranking number 1, B and C each
gets the rank number 2, and finally D gets ranking number 3, i.e. A = 1, B = 2, C = 2, D = 3.
To apply dense ranking in ordering set of spam campaigns, it is enough that we first find the
grade of each spam campaign. Afterwards, the campaigns get the rank according to their own
grade. The greater the grade, the lesser rank.
To order the 62 spam campaigns labeled in Section 5.4.2, firstly we should extract for each
campaign the four other ranking features explained in Section 5.5.1.
Concerning the features U and L, we consider the range of interesting domain as {.ca, .com}
and {English, F rench}, respectively. By considering the equal weight for each feature, i.e.
ωi = 0.2, 1 ≤ i ≤ 5, we calculate the grade of each campaign. The maximum number of
elements among 62 campaigns belongs to a portal campaign, containing 407 spam emails.
Hence, the number of elements in campaigns are normalized by dividing the number of elements
in a campaign divided to 407.
In Table 5.9, we report the properties of first three ranked campaigns.
where the grade of each campaign is calculated as following :
grade(campaign1) = 0.2.(1 + 0.96 + 0.98 + 1 + 0.5) = 0.88
grade(campaign2) = 0.2.(0.78 + 0.86 + 0.97 + 1 + 0.66) = 0.854
grade(campaign3) = 0.2.(0.15 + 0.91 + 1 + 1 + 1) = 0.812
The set of first ranked campaigns, reports the set of campaigns required to be analyzed and
followed by the investigators. The process is performed automatically, hence, in a short period
of time vital information is provided, which is almost impossible to be achieved by considering
a huge amount of spam emails as a whole.
85
5.6
Conclusion
Spam emails constitute a constant threat to both companies and private users. Not only these
emails are unwanted, occupy storage space and need time to be deleted, also they have become
vectors of security threat and used to perform cybercrimes, such as phishing and malware
distribution. In this chapter, we have presented a framework, named DWS, for analysis of
large amounts of spam emails collected through honeypots. We argue that DWS can provide
a helpful tool for police and investigators in forensic analysis of spam emails. In fact DWS
automatically clusters and classifies large amount of spam emails in labeled campaigns, to
eventually help investigator to focus on campaigns for a specific cybercrime, filtering out the
non-interesting spam emails. Moreover DWS is self learning, not requiring any preexistent
knowledge of the dataset to analyze. Instead a small set of data, named small knowledge is
provided. To update the small knowledge the investigators can add new discovered templates
to previous set of small knowledge.
Preliminary tests performed on a first dataset of more than 3200 emails showed a good accuracy
of the DWS framework.
Furthermore, a ranking methodology is proposed to order set of spam campaigns based on
the investigator priorities. The set of first ranked campaigns are the ones which should be
analyzed first.
86
Chapitre 6
Algebraic Formalization of CCTree
Despite being one of the most common approach in unsupervised data analysis, a very small
literature exists on the formalization of clustering algorithms. In this chapter we propose a
semiring-based methodology, named Feature-Cluster Algebra, which abstracts the representation of a labeled tree structure representing a hierarchical categorical clustering algorithm,
named CCTree ([127]). Through several theorems and examples we show that the abstract
schema fully abstracts the tree structure. The full abstraction provide this interesting property that it is possible to apply an algebraic term and a tree structure one instead of the
other, when needed. This means that it is possible to use the well established concepts in the
algebraic form of the clustering algorithm to get the equivalent result in the semantic form.
We apply the abstract schema of CCTree to formalize CCTree parallelism with the use of
rewriting system. To this end, a set of functions and relations are defined on feature-cluster
algebra. Then, we first propose a rewriting system to automatically identify whether a term
represents a CCTree term or not. Afterwards, a rewriting system is proposed to automatically
get a final CCTree term from the addition of two (or more) CCTree terms. The final CCTree
term is used to homogenize the structure of all CCTrees in parallel devices.
6.1
Introduction
Clustering is a very well-known tool in unsupervised data analysis, which has been the focus
of significant researches in different domains of computer security, spanning from intrusion
detection [145], spam campaign detection as explained in previous chapters, to clustering
Android malware [121]. The problem of clustering becomes more challenging when data are
described in terms of categorical attributes, for which, differently from numerical attributes,
it is hard to establish an ordering relationship [6]. The difficulty arises from the fact that the
similarity of elements cannot be computed with the use of well-known geometric distances,
e.g. euclidean distance. In categorical clustering, each attribute contains a domain of discrete,
mutually exclusive features, where each feature represents a value of an element. For example,
87
the attribute color may contain features as red and blue.
Clustering algorithms are vastly applied in real world problems, also in security problems,
which in present thesis have already been applied toward spam campaign detection. Notwithstanding, a very few works exist to express and solve the problems of clustering algorithms in
terms of formal methods. Formal methods are mathematically based languages, techniques,
and tools to specify general rules on a system, where the desired properties of the system can
be verified easily on the base of identified rules [37].
In the present work, we argue that using formal methods in CCTree, as a specific form of categorical clustering algorithm, provides an abstract representation of clusters, which facilitates
the analysis of cluster properties, while getting rid of confronting a large amount of data in
each cluster. The proposed formal scheme is used to formalize a challenging task in categorical
clustering algorithms, named parallel clustering.
CCTree (Categorical Clustering Tree) has a decision tree-like structure, which iteratively divides the data of a node on the base of an attribute, or domain of features, yielding the
greatest entropy. The division of data is shown with edges coming out from a parent node
to its children, where the edges are labeled with the associated features. A node respecting
the identified stop conditions, is considered as a leaf. The leaves of the tree, are the desired
clusters. Being notably significant the features in CCtree construction, i.e. the edge labels, a
CCTree has a feature-based structure.
Feature algebra [74] is a semiring-based formal method proposed to formalize feature-based
product lines, e.g. software product.
We import the idea of feature algebra to formalize feature-based CCTree structure and call
our proposed semiring-based algebraic structure : “Feature-Cluster Algebra”.
The notion of feature-cluster algebra is used to abstract CCTree representation as a term.
The CCTree term is applied to formalize CCTree parallelism on the base of Rewriting System.
Parallel clustering is a methodology proposed to alleviate the problem of time and memory
usage in clustering large dataset [42].
The contributions of this chapter can be summarized as follows :
— A semiring-based formal method, named Feature-Cluster Algebra, is proposed to abstract the representation of a categorical clustering algorithm, named CCTree ([127]).
The abstraction theory is a delightful mathematical concept, which constructs a brief
sketch of the original representation of a problem to deal with it easier. More precisely,
abstraction is the process of mapping a representation of a problem, called the ground
(semantic), onto a new representation, called abstract (syntax) representation, in a way
that it is possible to deal with the problem in the original space by preserving certain
desirable properties and in a simpler way to handle, since it is constructed from ground
representation by removing unwanted detail [59].
— Through several theorems and examples we show that the proposed approach fully
88
abstracts the CCTree representation under some conditions.
Full abstraction is an interesting property of abstract mapping, which guarantees that
we can apply the ground (semantic) representation and abstract (syntax) representation
of a problem alternatively.
— A rewriting system is proposed to automatically verify whether a term is a CCTree
term or not.
A rewriting system is a set of directed equations on a set of objects. Mostly the objects in
rewriting system are called terms and the directed equations are called rewriting rules.
The rewriting rules are applied to compute new terms by repeatedly replacing subterms
of a given term until the simplest form possible is obtained. The rewriting system is an
interesting mathematical concept which automatically creates a new desired final term
applying the correctly specified rewriting rules [43].
— The abstract form of the CCTree is applied to formalize the process of parallelizing
CCTree clustering on parallel computers with the use of rewriting system.
The proposed rewriting system contains a set of rewriting rules which direct us to get
a CCTree term from a non CCTree term, representing a CCTree which all CCTrees in
parallel devices can merge to it.
— We prove that the proposed rewriting systems are confluent.
The termination and confluence are two interesting properties of a rewriting system.
The termination of a rewriting system guarantees that the system does not contain
a loop of rules, which causes the non terminating process of applying the rewriting
rules. On the other hand, the confluent property of a rewriting system guarantees that
applying the rewriting rules on a given term results in a unique term.
This chapter is organized as follows. In Section 6.2, we present a review of the literature
about formalization methods applied in feature-based problems. In Section 6.3, the process of
transforming a CCTree to its equivalent algebraic expression is explained in terms of semiring.
In Section 6.4, we show the proposed algebraic structure fully abstract tree representation.
The relations on feature-cluster algebra are introduced in Section 6.5. In Section 6.6, we apply
abstract CCTree representation to formalize CCTree parallel clustering in terms of rewriting
system. We conclude and point to future directions in Section 6.7.
6.2
Related work
Feature models are information models in a way that a set of products, e.g. software products
or DVD player products, are represented as hierarchically arrangement of features, with different relationships among features [15]. Feature models are used in many applications as the
result of being able to model complex systems, being interpretable, and being able to handle
both ordered and unordered features [105]. [15] believe that designing a family of software
systems in terms of features, makes it easy to be understood by all stakeholders, rather than
89
other forms of representations. Representing feature models as a tree of features, were first
introduced by [82], to be used in software product line. Some studies [31], [32], show that tree
models combined with ensemble techniques, lead to an accurate performance on variety of
domains. In feature model tree, differently from CCTree, the root is the desired product, the
nodes are the features, and different representation of edges demonstrates the mandatory or
optional presence of features.
[73] [74], were the first who applied idempotent semiring as the basis for the formalization
of tree models of products, and called it feature algebra. The concept of semiring is used
to answer the needs of product family abstract form of expressions, refinements, multi-view
reconciliations, product development, and classification. The elements of semiring in the proposed methodology, are sets of products, or product families.
To the best of our knowledge, we are the first who applied an algebraic structure, to abstract
a categorical clustering algorithm representation and formalize the associated issues.
6.3
Feature-Cluster Algebra
In this section, we introduce our proposed semiring-based formal method, named featurecluster algebra, to abstract the CCTree representation. To this end, we first explain what
precisely a semiring implies. Then, the process of transforming a tree structure to its equivalent term is presented.
6.3.1
Semiring
In abstract algebra, the term algebraic structure generally refers to the set of elements together with one or more finitary operations respecting specified properties [68]. In particular, a
semiring is an algebraic structure containing two binary operations on a set of elements. More
precisely, a semiring is defined as follows.
Definition 6.1 (Semiring). A semiring is a set S, with two binary operations “+” , “·” ,
called addition and multiplication, respectively, such that (S, +) is a commutative monoid with
identity element 0, and (S, ·) is a monoid with identity element 1. The multiplication distributes
left and right over addition, and multiplication by 0 annihilates elements of S. A semiring for
which multiplication is commutative, is called a commutative semiring [68].
More precisely, S equipped with two binary operations “+”, “·” , such that 0 , and 1 are identity
elements to “+”, and “·”, respectively, is a semiring, if for all a, b, c ∈ S, the following laws
90
are satisfied :
(a + b) + c = a + (b + c)
0+a=a+0=a
a+b=b+a
(a · b) · c = a · (b · c)
1·a=a·1=a
a · (b + c) = (a · b) + (a · c)
(a + b) · c = (a · c) + (b · c)
0·a=a·0=0
Briefly, we write (S, +, ·, 0, 1) is a semiring.
A semiring (S, +, ·, 0, 1) is called an idempotent semiring, if for any a ∈ S, we have :
a+a=a
Semiring of Features
Lets consider a set of disjoint sorts, denoted as A, is given, where the carrier set of each sort
Ai ∈ A is denoted by VAi . In our context, we call the given set of sorts, the set of attributes,
S
and we call the union of sorts, denoted as V = Ai ∈A VAi , the set of values or features.
Example 6.1. We may consider the set of attributes as A = {color, size}, where the carrier
set of each attribute can be considered as Vcolor = {red, blue} and Vsize = {small, large}. In
this case, we have V = {red, blue, small, large}.
Definition 6.2 (Sort). We define the sort function which gets a set of features and returns a
set of the associated sorts of received feature as follows :
sort : P(V) → P(V)
sort({f }) = VA
f or f ∈ VA
sort(V1 ∪ V2 ) = sort({V1 }) ∪ sort({V2 })
Example 6.2. In the following, we present the application of sort function on a set of features
from Example 6.1 :
sort({red}) = {red, blue}
sort({red, small}) = sort({red}) ∪ sort({small}) = {red, blue, small, large}
Consider F = P(P(V)) be the power set of the power set of V, whilst we denote 1 = {∅} and
0 = ∅. We define the operations “+” and “·”, as choice and composition operators on F as
91
following :
·:F×F→F
F1 · F2 = {X ∪ Y : X ∈ F1 , Y ∈ F2 }
+:F×F→F
F1 + F2 = F1 ∪ F2
We say that F belongs to the set power of features F, if it respects one of the following syntax
forms :
F := 0 | {{f }} | F · F | F + F | 1
(6.2)
where f ∈ V.
Example 6.3. In the following, some elements of F on V = {red, blue, small, large} are
presented :
F1 = {{red, large}, {blue}}
F2 = {{small}}
F1 · F2 = {{red, large, small}, {blue, small}}
F1 + F2 = {{red, large}, {blue}, {small}}
In the problem of formalizing the categorical clustering, the set {{red, large}, {blue}} may
represent two clusters, where the elements of the cluster {red, large} have features red and
large in common, and the elements of the cluster {{small}} are all small. This means that
we use the addition to separate clusters, and we use multiplication to consider more features
in identifying the clusters. It is clear that any combination of the set of the features does not
necessarily represent a clustering.
Proposition 6.4. It is easy to verify that the two operations “+” and “·” respect the following
92
properties for every F1 , F2 , F3 ∈ F :
(F1 + F2 ) + F3 = F1 + (F2 + F3 )
(6.3)
F1 + F2 = F2 + F1
(6.4)
F1 · F2 = F2 · F1
(6.5)
(F1 · F2 ) · F3 = F1 · (F2 · F3 )
(6.6)
F1 · (F2 + F3 ) = (F1 · F2 ) + (F1 · F3 )
(6.7)
(F1 + F2 ) · F3 = (F1 · F3 ) + (F2 · F3 )
(6.8)
1 · F1 = F1 · 1 = F1
(6.9)
0 · F1 = F1 · 0 = 0
(6.10)
0 + F1 = F1 + 0 = F1
(6.11)
F1 + F1 = F1
(6.12)
Theorem 6.5. The quintuple (F, +, ·, 0, 1) constitutes an idempotent commutative semiring.
Démonstration. The proof is straightforward from the Proposition 6.4.
Definition 6.3. Lets consider |.| returns the number of elements in a set. Then, we say F ∈ F
belongs to the set Fn , if |F | = n. Under this definition, F1 , i.e. the subset of F, where each
element contains just one dataset of features, is the desired one according to our problem. In
this case, for F ∈ F1 , we remove the brackets and separate the features belonging to the same
set by multiplication. Hence, we consider F ∈ F1 if it can be written as one of the syntax forms
as : 0 | f | F1 · F2 | 1, for f ∈ V.
It is noticeable when two elements of F1 are added or multiplied, they follow the same properties following the main semiring defined on F. In the following example, we show how this
simpler representation is used in the rest of the chapter.
Example 6.6. We simplify the elements of Example 6.3 according to Definition 6.3, as the
following :
F1 = {{red, large}, {blue}} = {{red, large}} + {{blue}} = red · large + blue
F2 = {{small}} = small
F1 · F2 = {{red, large, small}, {blue, small}} = {{red, large, small}} + {{blue, small}}
= red · large · small + blue · small
F1 + F2 = {{red, large}, {blue}, {small}} = {{red, large}} + {{blue}} + {{small}}
= red · large + blue + small
The semiring of features can be used to represent different feature-based clustering algorithms.
In our context, planing to address the parallel clustering, we also require to discuss on different
93
datasets that the clusters are originated from. To this end, in upcoming subsection we present
the semiring of elements.
Semiring of Elements
Let us consider that the set of the sorts, or the set of attributes A with an order among
attributes, is given. Suppose |A| = k, and without loss of generality A1 , A2 , . . . , Ak are the
ordered sorts which range over A. We say s belongs to the set of elements S, if s ∈ VA1 ×
VA2 × . . . × VAk × N, where the carrier of attributes are arbitrarily ordered (then fixed) for
each problem, and N is the set of natural numbers. Hence, S ⊆ VA1 × VA2 × . . . × VAk × N,
i.e. s ∈ S can be written as s = (x1 , x2 , · · · , xk , n), where xi ∈ VAi for 1 ≤ i ≤ k, and n ∈ N
is a natural number representing the ID of an element. For the sake of simplicity, we may use
the alternative representation xi ∈ Ai instead of xi ∈ VAi .
In our problem, S is the set of all elements desired to be clustered. As the result of having
different sets of elements to be clustered in the problem of parallel clustering, we define a
semiring of the power set of all elements. In this case, if we have for example two datasets of
elements, say S1 and S2 , then S = S1 ∪ S2 .
Example 6.7. Consider that in Example 6.3, we have the Cartesian product of carrier of attributes as “color×size”, then the tuples S = {(red, small, 1) , (blue, small, 2), and (red, large, 3)}
is a set of elements on V to be clustered, in a specific problem.
We formally define two operations “+” and “·” as union and intersection of elements of P(S)
(the power set of S), as the following :
· : P(S) × P(S) → P(S)
S1 · S2 = S1 ∩ S2
+ : P(S) × P(S) → P(S)
S1 + S2 = S1 ∪ S2
Formally, we say S belongs to the set of elements S ∈ P(S), if it respects one of the following
forms :
S := ∅ | S 0 | S + S | S · S | S
where S 0 ⊆ S.
94
(6.13)
Proposition 6.8. It is easy to verify that operations “+” and “·” on every S1 , S2 , S3 ∈ P(S)
respect the following properties :
(S1 + S2 ) + S3 = S1 + (S2 + S3 )
(6.14)
∅ + S1 = S1 + ∅ = S1
(6.15)
S1 + S2 = S2 + S1
(6.16)
(S1 · S2 ) · S3 = S1 · (S2 · S3 )
(6.17)
S · S1 = S1 · S = S1
(6.18)
S1 · (S2 + S3 ) = (S1 · S2 ) + (S1 · S3 )
(6.19)
(S1 + S2 ) · S3 = (S1 · S3 ) + (S2 · S3 )
(6.20)
∅ · S1 = S1 · ∅ = ∅
(6.21)
S1 + S1 = S1
(6.22)
S1 · S1 = S1
(6.23)
S1 · S2 = S1
if
S1 ⊆ S2
(6.24)
Theorem 6.9. The quintuple (S, +, ·, ∅, S) is an idempotent commutative semiring.
Démonstration. The proof is straightforward from Proposition 6.8.
Note : It should be noted that the operations “+” and “·” are overloaded to the kind of
elements that they are applied on. This means that if the operation “+” is used between two
sets of elements, it refers to the addition operation in semiring of elements, and when the
operation “+” is applied between two sets of features, it refers to the addition operation in
semiring of features. The same property satisfies for multiplication operation “·”.
Semiring of Terms
In sections 6.3.1 and 6.3.1 we introduced two semirings on the set of features and the set of
elements, respectively. The reason underlying this choice is that in our context 1) categorical
clusters are generally specified with the set of features, 2) in formalizing the parallel clustering
we have several datasets and it is required to clearly specify which dataset of elements we refer
to. In what follows, we construct the semiring of terms with the use of previous semrirings,
which will be used to abstract the tree structure and to formalize parallel clustering. In the
rest of the chapter, we use the same notions and symbols introduced in 6.3.1 and 6.3.1.
Recalling that a cluster in CCTree can uniquely be identified by a set of elements respecting a
set of features. We define the satisfaction relation to formally express the concept of cluster.
Definition 6.4 (Satisfaction Relation ). Recalling that when the elements of F contain just
one dataset of features we remove the brackets (Definition 6.3). Hence, we define satisfaction
relation, denoted with , as the following :
95
: F × P(S) → P(S)
(f, {(x1 , x2 , · · · , xk , n)}) = {(x1 , x2 , · · · , xk , n)}
if
∃i, 1 ≤ i ≤ k, s.t xi = f
(f, {(x1 , x2 , · · · , xk , n)}) = ∅
if
@i, 1 ≤ i ≤ k, s.t xi = f
(f, S1 ∪ S2 ) = (f, S1 ) ∪ (f, S2 )
(F1 · F2 , S) = (F1 , S) ∩ (F2 , S)
and when (F, S) 6= ∅, we say that S satisfies F . For the sake of simplicity, we apply the
alternative representation F S instead of (F, S) when (F, S) 6= ∅.
We consider that the multiplication “·” and “+” over respect the following properties :
(F1 S1 ) · (F2 S2 ) = (F1 · F2 ) S2
if
S1 · S2 = S2
(6.25)
(F1 S1 ) + (F2 S2 ) = (F1 + F2 ) S2
if
S1 + S2 = S2
(6.26)
where S1 · S2 = S2 means S2 ⊆ S1 , and S1 + S2 = S2 means S1 ⊆ S2 . In the case neither set is
a subset of the other, the multiplication and addition return the received elements unchanged.
It should be noted that “·” and “+” are overloaded to their own definition for the semiring of
features and the semiring of elements when they are applied between two sets of features and
two sets of elements, respectively.
Roughly speaking, these properties can be interpreted as follows. The multiplication is used to
find the resulted tuples from the intersection of two clusters resulted two sets that one is the
subset of the other one ; and the addition refers to the union of two clusters, where one is the
subset of the other one. In our context, the property 6.25 is applied to address the concept of
division of a cluster to new smaller clusters. In this case, each small new cluster satisfies the
features of the main cluster, plus more restricted features. Moreover, the property 6.26 is used
to get the simpler form of clusters according to Definition 6.3.
Example 6.10. Lets consider the set of elements S = {(red, small, 1), (blue, small, 2), (red, large, 3)}
on the set of features V = {red, blue, small, large}, are given. The following examples represent
different clusters on this datasets in terms of satisfaction relation :
(red, {(red, small, 1)}) = {(red, small, 1)}
(red, {(blue, small, 2)}) = ∅
(red, {(red, small, 1), (blue, small, 2)}) = (red, {(red, small, 1)})
∪ (red, {(blue, small, 2)}) = {(red, small, 1)} ∪ ∅ = {(red, small, 1)}
(red · small, {(red, small, 1)}) = ( red, { (red, small, 1)})
∩ ( small, { (red, small, 1)}) = {(red, small, 1)} ∩ {(red, small, 1)} = {(red, small, 1)}
96
Proposition 6.11. For F1 , F2 ∈ F and S ∈ P(S), the symbol “” satisfies the following
properties with respect to “+” and “ ·” :
(F1 · F2 ) S = (F1 S) · (F2 S)
(6.27)
(F1 + F2 ) S = (F1 S) + (F2 S)
(6.28)
Démonstration. The proof is straightforward from the properties 6.25 and 6.26, since we have
S · S = S and S + S = S.
Actually, the equations 6.27 and 6.28 express how we can transform the different forms of
F ∈ F to the form of F ∈ F1 .
Example 6.12. The following equation shows the transformation of 6.27 and 6.28 to a set of
features as F ∈ F1 defined in Definition 6.3.
{{f1 , f2 }, {f3 }} S = {{f1 , f2 }} S + {{f3 }} S = f1 · f2 S + f3 S
The form of F ∈ F1 , is a particular desired representation of the set of features which will be
used in our context. Hence, we attribute a specific name to it as what follows.
Definition 6.5 (Feature-Cluster (Family) Term). The set of feature-cluster family terms on
V and S denoted as FCV,S (or simply FC if it is clear from the context) is the smallest set
containing elements satisfying the following conditions :
if
S⊆S
then
S ∈ FC
if
F ∈ F1 , S ⊆ S
then
F S ∈ FC
if
τ1 ∈ FC, τ2 ∈ FC
then
τ1 + τ2 ∈ FC
In this case, we call S and F S a feature-cluster term and the addition of one or more featurecluster terms is called feature-cluster family term. We may simply use FC-term to refer to a
feature-cluster family term.
We define the block function, which receives a feature-cluster family term and returns the set
of its blocks. Formally, we have :
block : FC → P(FC)
block(S) = {S}
block(F S) = {F S}
block(τ1 + τ2 ) = block(τ1 ) ∪ block(τ2 )
In the case that no feature specifies S directly, it is called an atomic term. The set of all atomic
terms is denoted as A .
97
Example 6.13. In the following, some examples of FC-terms are presented :
S ∈ FC
red · small S ∈ FC
red · small +blue S ∈ FC
Example 6.14. Suppose that the term τ = red S + blue S is given. Applying the block
function on τ results in :
block(red S + blue S) = {red S, blue S}
Definition 6.6 (FC-Term Comparison). We say two FC-terms τ1 and τ2 are equal, denoted
by τ1 ≡ τ2 , if for different representations of FC-terms, it satisfies the following relations :
S1 ≡ S2 ⇔ S1 = S2
F1 S1 ≡ F2 S2 ⇔ S1 = S2 , F1 = F2
τ ≡ τ 0 ⇔ block(τ ) = block(τ 0 )
Example 6.15. In the following examples two simple equivalence of FC-terms have been
shown :
red · small S ≡ small · red S
red · small S + blue S ≡ blue S + small · red S
Definition 6.7 (Term). We call τ a term, if it has one of the following form :
τ := S | F S | τ + τ | τ · τ
(6.29)
where
S := ∅ | S 0 | S + S | S · S | S
(6.30)
F := 0 | {{f }} | F + F | F · F | 1
(6.31)
in which 6.30 and 6.31 satisfy the properties specified in 6.3.1 and 6.3.1, respectively.
The set of terms on S and F is denoted as CS,F , or abbreviated as C, where it is known
beforehand on which datasets it has been constructed.
As previously discussed, when an element of F contains just one dataset of features, we remove
the brackets, and with the use of “·” we separate the features belonging to the associated dataset.
Example 6.16. In the following some examples of terms on V = {red, blue, small, large}
and dataset S, are presented :
red · small S
red · small S + blue S 0
(red · small S) · (blue S 0 )
{{red, large}, {blue}} S
98
Theorem 6.17. Two identity elements of C with respect to “+” and “·” are 0 ∅ and 1 S,
respectively.
Démonstration. From properties 6.25 and 6.26, and the term definition 6.7, which considers
the commutativity of multiplication and addition among terms, we have :
(1 S) · (F S) = (1 · F ) S = F S
(6.32)
(0 ∅) · (F S) = (0 · F ) ∅ = 0 ∅
(6.33)
(0 ∅) + (F S) = (0 + F ) S = F S
(6.34)
For the other elements of C, the proof is straightforward from the above equations, and
properties 6.25, 6.26.
Theorem 6.18. The quantiple (C, “+”, “·”, 0∅, 1S) is an idempotent commutative semiring.
Démonstration. The proof is straightforward from the semrirng definition (Definition 6.1),
Sections 6.3.1, 6.3.1, and the properties mentioned in 6.3.1.
Definition 6.8 (Feature-Cluster Algebra). The semiring (C, “ + ”, “ · ”, 0 ∅, 1 S) is called
a feature-cluster algebra.
It is noticeable that in present work our terms in following sections, mostly, belong to the set
of feature-cluster family terms FC ⊆ C. This means that they as the elements of the semiring
(C, “ + ”, “ · ”, 0 ∅, 1 S), follow the same operation and properties among the elements of the
proposed feature-cluster algebra.
6.4
Feature-Cluster (Family) Term Abstraction
In this section, we plan to relate the concept of feature-cluster algebra to tree structure. To
this end, firstly, some preliminary notions related to graph, abstraction and rewriting system,
are presented. Graph theory notions is used to formally represent tree structure. On the other
hand, the abstraction theory is used to prove that the syntax form of trees (under some
conditions) is equivalent to the semantic form of tree structure.
This property is desirable in the sense that we are able to apply several interesting algebraic
calculation on syntax forms, whilst whenever it is required it is possible to transform it to its
equivalent semantic structure, preserving the same properties of applying the calculations on
semantic forms.
Moreover, the rewriting system is applied to automatically verify whether a term represents
a CCTree or not. Moreover, we can automatically get a homogenized CCTree term resulted
from the addition of several CCTree terms.
99
6.4.1
Preliminary Notions
Graph Theory Preliminaries
In graph theory [62], a tree is an undirected graph in which
any two vertices are connected by exactly one path. A forest is a disjoint union of trees. A tree
is called a rooted tree if one vertex has been designated the root, which means that the edges
have a natural orientation, towards or away from the root [62]. A node directly connected to
another node when moving away from the root is the child node. In a rooted tree, every node
except the root has one parent node, called predecessor. Moreover, a child node in a rooted
tree is called a successor. A node without successors in a rooted tree is called a leaf. A tree is a
labeled tree if the edges of the tree are labeled. A branch of a tree, refers to the path between
the root and a leaf in a rooted tree [62]. A descendant tree of an edge f in a rooted tree T , is
the subtree of T following edge f .
Definition 6.9 (Graph Homomorphism, Graph Isomorphism). Graph homomorphism sort
from a graph G = (V, E) to a graph G0 = (V 0 , E 0 ), written as ζ : G → G0 , is a mapping
ζ : V → V 0 from the vertex set of G to the vertex set of G0 such that {u, v} ∈ E implies
{ζ(u), ζ(v)} ∈ E 0 [70]. If the homomorphism ζ : G → G0 is bijection whose inverse function
is also a graph homomorphism, then ζ is a graph isomorphism. In our context it is important
that both {u, v} ∈ E and {ζ(u), ζ(v)} ∈ E 0 have the same edge label. Under this condition,
we say that two graphs G = (V, E) and G0 = (V 0 , E 0 ) are equivalent, denoted as G ≈ G0 , if
V = V 0 , E = E 0 , for {u, v} ∈ E and {ζ(u), ζ(v)} ∈ E 0 , we have {u, v} = {ζ(u), ζ(v)}, and
finally G and G0 are isomorphic.
Definition 6.10 (Tree Structure). In our context, a graph structure is a triple (F, Q, ω)
where : F represents the set of edge labels ; Q is the set of states or nodes ; and ω is the set of
transition function, denoted as ω : Q × F → Q. A graph structure is a tree structure if there
is no cycle in transitions. In this case, the transitions are written such that each parent node
is connected to its children moving from root.
We note a transition ω(s1 , f ) = s2 as a triple (s1 , f, s2 ). Hence, the set of transitions in our
context is a set of triples, where the first component is a parent node (predecessor) and the last
component is a child (successor) of first component, whilst the middle component is the edge
label (feature) transiting from first this parent node to its child.
Note : It is worth noticing that a CCTree is a tree structure, which in our context can be
formally presented as a triple where the first component (F ) contains the set of edge labels,
the second component (Q) contains the nodes of CCTree, and the last component is the set of
transitions between a parent node through edge labels to its children. We label the root node
as the main dataset desired to be clustered.
Abstraction Theory Preliminaries
What the abstraction means in general ? Some of the
synonyms of the word “abstract” are “brief”, “synopsis” and “sketch”, some of the synonyms
100
of the verb “to abstract” are “to detach” and “to separate”. The intuition which comes out of
this list of synonyms is that the process of abstraction is related to the process of separating,
extracting from a representation of an object or subject an “abstract” representation that
consists of a brief sketch of the original representation [59].
More precisely, the abstraction is the process of mapping a representation of a problem, called
the “ground” representation, onto a new representation, called the “abstract” representation,
such that it helps to deal with the problem in the original search space by preserving certain
desirable properties and is simpler to handle as it is constructed from the ground representation
by “ not considering the details” [59]. The most common use of abstraction is in theorem
proving, which abstracts the goal, to prove its abstracted version, and then to use the structure
of the resulting proof to help construct the proof of the original goal. This is based on the
assumption that the structure of the abstract proof is equivalent to the structure of the proof
of the goal. The other main use of abstraction theory has been to study the formal properties
of abstractions and the operations like composition and ordering which can be defined upon
them [59].
An abstraction can formally be written as a function [[.]] : X → Y from the ground representation (semantic form) of an object to its abstract form (syntax form). We say [[.]] adequately
abstracts X if from the equivalence of two elements of semantic forms, we get the equivalence
of their equivalent syntax forms. Formally, if the equivalence of elements in X is denoted by
' and the equivalence of elements in Y is represented with ∼
=, then :
[[X1 ]] ∼
= [[X2 ]] ⇒ X1 ' X2
(6.35)
we say [[.]] abstracts X if we have :
X1 ' X2 ⇒ [[X1 ]] ∼
= [[X2 ]]
(6.36)
when 6.35 and 6.36 are both satisfied, we say [[.]] fully abstracts X , i.e. we have :
[[X1 ]] ∼
= [[X2 ]] ⇔ X1 ' X2
Rewriting System Terminology A rewriting system is shown with a set of directed equations on a set of objects. Mostly the objects in rewriting system are called terms and the
directed equations are called rewriting rules. The rewriting rules are applied to compute new
terms by repeatedly replacing subterms of a given term until the simplest form possible is
obtained [43].
More precisely, a rewriting rule is an ordered pair, written as x → y of terms x and y. Similar
to equations, rules are applied to replace instances of x by corresponding instances of y. Unlike
the equations, rules are not applied to replace instances of the right-hand side y [43]. A term
over symbols G, constants K, and variables X is either a variable x ∈ X , a constant k ∈ K, or
101
an expression of the form g(t1 , t2 , . . . , tn ), where g ∈ G is a function symbol of n arguments,
and ti are terms [43]. A derivation for a rule →, is a sequence of the form t0 → t1 → . . .. The
term t is reducible with respect to rule →, if there is a term u such that t → u ; otherwise it
is considered as irreducible. A rewrite system R is a set of rewrite rules, x → y, where x and
y are terms. The term u is a → normal form of t, if t →∗ u and u is irreducible via →, where
→∗ means that the rule → is applied n times (n ∈ N). A relation → is terminating, if there is
no infinite derivations t0 → t1 → . . ., which means that an infinite derivations does not reach
to a normal term. A relation → is confluent, if there is an element v such that s →∗ v and
t →∗ v whenever u →∗ s and u →∗ t for some elements s, t and u. A relation → is convergent,
if it is terminating and confluent. Convergent rewriting systems are interesting, because all
derivations lead to a unique normal form [43].
A conditional rule is an equational implication in which the term in the conclusion is reached
just if the conditions are satisfied. We use the form x1 = u1 ∧ . . . ∧ xn = un |x → y to show
that under the conditions x1 = u1 ∧ . . . ∧ xn = un we have x → y.
6.4.2
Graph Structure and Feature-Cluster Family Terms
In this subsection, we explain how graph structure and feature-cluster family term can be
transformed to each other. To this end, we first present the “meaning” relation to transform a
feature-cluster family term to a labeled graph structure. Afterwards, we present a function to
get a feature-cluster family term from a labeled tree structure. Then, we prove in a theorem
that if two labeled trees are equivalent, they return equal terms. However, we show that the two
equal feature-cluster family terms do not necessarily return two equivalent graph structures.
We prove that under the condition of considering a fixed order among the features, the latter
requirement will also be respected.
In the provided examples, attributes Color = {r(ed), b(lue)}, Size = {s(mall), l(arge)}, and
Shape = {c(ircle), t(riangle)} are used to describe the terms.
To avoid the confusion of different representations of an FC-term, in what follows we present
the definitions of factorized and non factorized terms.
Definition 6.11 (Factorized Term). We define the factorization rewriting rule through an
A
attribute A ∈ A, denoted as −
→, from an FC-term to its factorized form as the following :
A
f · τ1 + f · τ2 −
→ f · (τ1 + τ2 )
f or
f ∈A
we denote the normal form of applying the factorization rewriting rule on term τ applying
factorized rewriting rule, through attribute A as τ ↓A , and the set of factorized forms of the
terms of FC is denoted by FC ↓. A term after factorization is called a factorized term.
Definition 6.12 (Defactorization). We define the defactorized rewriting rule on an FC-term
102
as what follows :
f · (τ1 + τ2 ) →d f · τ1 + f · τ2
A normal term resulted from defactorized rewriting rule is called a non factorized term. A non
factorized form of the term τ is denoted as τ ↑. The set of non factorized terms of FC are
denoted by FC ↑.
Example 6.19. In what follows, we show how factorization and defactorization perform. For
factorization we have :
(r · s S + r · c S + b · s S) ↓color = r · (s S + c S) + b · s S
and for defactorization :
r · (s S + c S) + b · s S →d r · s S + r · c S + b · s S
From Feature-Cluster Family Term to Tree Structure Applying the same notions
presented in previous sections, in what follows we define three functions, which return the set
of edge labels, the set of nodes, and the set of transitions from a received FC-term, respectively.
These three functions are used in our context to get a forest structure from an FC-term.
We define the function of feature, denoted by Θ, which gets a non factorized FC-term and
returns a set of features as follows :
Θ : FC ↑→ P(V)
Θ(S) = ∅
Θ(f S) = {f }
Θ(f · F S) = {f } ∪ Θ{F S}
Θ(τ1 + τ2 ) = Θ(τ1 ) ∪ Θ(τ2 )
we define the function of states, noted as Φ, which gets a non factorized FC-term and returns
a set of FC-terms, as follows :
Φ : FC ↑→ P(FC ↑)
Φ(S) = {S}
Φ(f S) = {f S, S}
Φ(f · F S) = {f · F S} ∪ Φ(F S)
Φ(τ1 + τ2 ) = Φ(τ1 ) ∪ Φ(τ2 )
103
Moreover, we define the transition function, denoted as Ω, which gets a non factorized FC-term
and returns a triple which returns the transitions from the associated node as follows :
Ω : FC ↑→ P(FC ↑ ×V × FC ↑)
Ω(S) = ∅
Ω(f S) = {(S, f, f S)}
Ω(f · F S) = {(F S, f, f · F S)} ∪ Ω(F S)
Ω(τ1 + τ2 ) = Ω(τ1 ) ∪ Ω(τ2 )
Now we are ready to introduce the meaning relation which gets a non factorized FC-term and
returns a forest structure.
Definition 6.13. The meaning relation, denoted as [[.]], gets a non factorized FC-term and
returns a triple, representing a forest (or tree) structure, as following :
[[.]] : FC ↑→ GV,FC
[[τ ]] = (Θ(τ ), Φ(τ ), Ω(τ ))
where GV,FC is the set of all possible forest structures on the set of edges’ labels V and the set
of nodes’ labels FC.
Example 6.20. In what follows, we show how a feature-cluster family term is transformed to
its equivalent graph structure according to the above rules :
[[r S + b · l S + b · s S]] =
({b, r, l, s},
{S, r S, b S, b.l S, b.s S},
{(S, r, r S), (b S, l, b · l S), (S, b, b S), (b S, s, b · s S)})
From Tree Structure to a Feature-Cluster Family Term We define the function root,
denoted as r, which gets a tree and returns the root of the tree. Formally :
r : TV,FC → Q
r(T ) = {s | ∪si ∈Q {(si , f, s)} = ∅}
where TV,FC is the set of rooted trees on V and FC.
We define the set of edge labels of the children of r(T ) as follows :
δ(T ) = {f | ∃ s0 ∈ Q s.t. (r(T ), f, s0 ) ∈ ω}
Moreover, in a tree T , the descendant tree directly after edge f , as the derivative tree of T
following edge f , is denoted by ∂f (T ). We define the Ψ function which gets a tree structure
104
T , and returns the features as follows :
Ψ(T ) =
X
f · Ψ(∂f (T ))
(6.37)
f ∈δ(T )
where Ψ(T ) = 1 when δ(T ) = 1. We represent f · 1 as f .
We define the transform function, denoted by ψ, which gets a set of k labeled trees (forest)
and returns an FC-term as follows :
ψ : GV,FC → FC
ψ(∅) = 0
ψ(T1 ∪ T2 ) = Ψ(T1 ) r(T1 ) + Ψ(T2 ) r(T2 )
Example 6.21. Suppose the following tree is given :
M = ({f1 , f2 }, {s, s1 , s2 }, {(s, f1 , s1 ), (s, f2 , s2 )})
then, the only state to which there is no transition is node s. Hence, we have :
Ψ(M ) = f1 · Ψ(∅, {s1 }, ∅) + f2 · Ψ(∅, {s2 }, ∅) = f1 · 1 + f2 · 1 = f1 + f2
which the resulting term is equal to :
ψ(M ) = Ψ(M ) s
Definition 6.14. A term resulting from a CCTree structure, or equivalently transformable to
a tree structure representing a CCTree, is called a CCTree term.
Example 6.22. Suppose the CCTree of Figure 6.1 is given. The tree structure of this CCTree
can be written as the following :
({red, blue, small, large}, {S, Sr , Sb , Sb·s , Sb·l },
{(S, red, Sr ), (S, blue, Sb ), (Sb , small, Sb.s ), (Sb , large, Sb.l )})
The CCTree term resulting from this CCTree equals to :
red S + blue · small S + blue · large S
Proposition 6.23. For each non factorized FC-term τ , there exists at least one forest structure
in GV,FC that represents τ . Moreover, for each labeled forest structure T in GV,FC , there exists
a unique term that represents T .
Démonstration. The proof is straightforward from the proposed methodology of transforming
a forest structure to a term and vise versa.
105
S
red
blue
Sr
Sb
large
small
Sb.s
Sb.l
Figure 6.1 – A Small CCTree
Theorem 6.24. The meaning relation [[.]] adequately abstracts a graph structure resulting
from a feature-cluster (family) term on V and the same fixed dataset of elements S ⊆ S. This
means that for two non factorized FC-terms τ and τ 0 we have :
[[τ ]] ≈ [[τ 0 ]] ⇒ τ ≡ τ 0
(6.38)
Intuitively, the relation 6.38 expresses that if two forest structures resulting from two FCterms are equal, by certain we can conclude that the original terms were equal as well. In
other words, if τ 6≡ τ 0 then we can conclude that [[τ ]] 6≈ [[τ 0 ]].
Démonstration. From Proposition 6.23, for each non factorized FC-term there exists a unique
term representing it. This means that [[τ ]] and [[τ 0 ]] certainly return a term. Now, suppose
that the left hand side of 6.38 is satisfied. Hence, we have :
[[τ ]] ≈ [[τ 0 ]] ⇒ Θ(τ ) = Θ(τ 0 ) , Φ(τ ) = Φ(τ 0 ) , ∆(τ ) = ∆(τ 0 )
⇒ block(τ ) = block(τ 0 ) ⇒ τ ≡ τ 0
(6.39)
(6.40)
where 6.39 is resulted from the equivalent graph structures of τ and τ 0 , and 6.40 is satisfied from
Φ(τ ) = Φ(τ 0 ) and the fact that two main terms were originated from the same dataset.
The following example contradicts the satisfiability of relation 6.38 from right to left.
Example 6.25. The two following feature-cluster family terms are equivalent in terms of term
comparison 6.6, i.e. we have :
f1 · f2 S + f1 · f3 S
≡
106
f2 · f1 S + f3 · f1 S
but their equivalent tree representation are not equivalent, since we have :
[[f1 · f2 S + f1 · f3 S]]
= ({f1 , f2 , f3 }, {S, f1 S, f1 · f2 S, f1 · f3 S},
{(f1 S, f2 , f1 .f2 S), (f1 S, f3 , f1 .f3 S), (S, f1 , f1 S)})
[[f2 · f1 S + f3 · f1 S]]
= ({f1 , f2 , f3 }, {S, f2 S, f3 S, f2 .f1 S, f3 .f2 S},
{(f2 S, f1 , f2 .f1 S), (S, f2 , f2 S), (f1 S, f3 , f1 .f3 S), (S, f1 , f1 S)})
where the first one contains five nodes, whilst the second one contains four nodes. This means
that they are not isomorphic graphs.
This example shows that commutativity of “·” is not an appropriate property for full abstraction. In what follows, we will show that the reverse of 6.38 is satisfied if an order of features is
identified on the set of features, which solves the problem of multiplication (“·”) commutativity.
Definition 6.15 (Ordered Features). We say that the set of features V is an ordered set of
features if there is an order relation “<” on V, such that (V, <) is a total order set. This
means that for any f1 , f2 ∈ V we either have f1 < f2 or f2 < f1 . We say F1 is exactly equal
to F2 , denoted by F1 ∼
= F2 , if considering the order of features they are equal.
Definition 6.16 (Order Rewriting Rule). Let an ordered set of features (V, <) be given. We
say an FC-term is an ordered FC-term on (V, <), if it is the normal form of applying the
following rewriting rule :
f1 · f2 S →O f2 · f1 S
if
f1 < f2
∀ f1 , f2 ∈ V
Moreover, we define a rewriting rule which orders the features of an FC-term based on an
attribute A ∈ A as follows :
A
f2 · f1 S −
→O f1 · f2 S
if f1 ∈ A
we represent the normal for of a term τ applying above rewriting rule, based on attribute A,
as τ ⇓A .
Example 6.26. Suppose that the set of features V1 = {red, blue, small, large} is given. Without loss of generality, fixing a strict order “<” among them as “red < blue < small < large”
results in having (V1 , <) as a total ordered set. The following examples show how ordered FCterms on V1 are obtained applying order rewriting rule :
red · small S → small · red S
red · small S + blue · large S → small · red S + large · blue S
Moreover : red · small small · red, whilst red · small ∼
= red · small.
107
Definition 6.17 (Ordered FC-term Comparison). We say two ordered FC-terms on (V, <)
are exactly equal, denoted by ∼, as the smallest relation for which the terms respect one of
the following relations :
2. if
then S1 ∼ S2
S1 = S2 ∧ F1 ∼
= F2 then
3. if
∀τi ∈ block(τ ) ∃τj ∈ block(τ 0 ) s.t τi ∼ τj
1. if
and
S1 = S2
F1 S1 ∼ F2 S2
∀τj ∈ block(τ 0 ) ∃τi ∈ block(τ ) s.t τj ∼ τi
then
τ ∼ τ0
Example 6.27. Lets consider the ordered set of features of Example 6.26 is given. The following examples show how two ordered FC-terms are compared :
red · small S small · red S
red · small S ∼ red · small S
red · small S + blue S ∼ blue S + red · small S
Theorem 6.28. Let (V, <) be a total ordered set of features and S ⊆ S. The meaning relation
[[.]] abstracts the forest (tree) structure resulted from the ordered non factorized FC-terms on
V and S. This means considering τ and τ 0 be two arbitrary ordered non factorized FC-terms
on (V, <) and S ⊆ S, we have :
τ ∼ τ 0 ⇒ [[τ ]] ≈ [[τ 0 ]]
(6.41)
Démonstration. Suppose the left side of 6.41 satisfies. This means that for each feature-cluster
term τi ∈ τ there exists a feature-cluster term τj ∈ τ 0 such that τi and τj are exactly equal.
This property causes that the set of transitions of [[τi ]] to be equal to the set of transitions of
[[τj ]]. Consequently, 6.41. More precisely, we have :
τ ∼ τ 0 ⇒ ∀τi ∈ block(τ )∃τj ∈ block(τ 0 ) s.t τi ∼ τj (⇒ [[τi ]] ≈ [[τj ]]),
(6.42)
∀τj ∈ block(τ 0 )∃τi ∈ block(τ ) s.t τj ∼ τi (⇒ [[τi ]] ≈ [[τj ]])
⇒[[τ ]] ≈ [[τ 0 ]]
(6.43)
Now we are ready to present the main theorem of this section, which provides the conditions
of full abstraction.
Theorem 6.29 (Main Theorem). Let the ordered set of features (V, <), the set of elements
S ⊆ S are given. The meaning function [[.]] fully abstracts the ordered feature-cluster family
terms on (V, <) and S. This means that for two arbitrary ordered feature-cluster family terms
τ and τ 0 on V and S, we have :
[[τ ]] ≈ [[τ 0 ]] ⇔ τ ∼ τ 0
(6.44)
Démonstration. The proof is straightforward from the proofs of theorems 6.24 and 6.28.
108
6.5
Relations on Feature-Cluster Algebra
In this section, we define several relations on feature-cluster algebra and discuss the properties
of the proposed relations. Here, we will use the same notions and symbols introduced in 6.3.1,
6.3.1, and 6.3.1.
Definition 6.18 (Attribute Division). Attribute division (DA ) is a function from A × FC
to {T rue, F alse}, which gets an attribute and a non factorized FC-term as input ; it returns
T rue or F alse as follows :
DA : A × FC ↑→ {T rue, F alse}
...................................................................
DA (A, S) = F alse
DA (A, f S) = T rue
if
f ∈A
DA (A, f S) = F alse
if
f∈
/A
DA (A, f · F S) = DA (A, f S) ∨ DA (A, F S)
DA (A, τ1 + τ2 ) = DA (A, τ1 ) ∧ DA (A, τ2 )
The concept of attribute division is used order the attributes presented in a term, which will
be discussed later.
Example 6.30. In the following we show how attribute division performs :
DA (color, r · s S + r · c S + b · s S)
= DA (color, r · s S) ∧ DA (color, r · c S) ∧ DA (color, b · s S) = T rue
Definition 6.19 (Initial). We define the initial (δ) function from P(FC ↑) to P(F), which
gets a set of ordered non factorized terms on (V, <), and returns a set of the first features of
each term as follows :
δ : P(FC ↑) → P(F)
δ(∅) = {0}
δ({S}) = {1}
δ({f · F S}) = {f }
δ({τ1 + τ2 }) = δ({τ1 }) ∪ δ({τ2 })
δ({τ1 , τ2 }) = δ({τ1 }) ∪ δ({τ2 })
with the following property :
δ({X, Y }) = δ(X) ∪ δ(Y )
109
where X, Y ∈ P(FC ↑).
In the case that the input set contains just one term, we remove the brackets, i.e. δ({τ }) = δ(τ ),
when |{τ }| = 1. Moreover, when the output set also contains just one element, for the sake of
simplicity we remove the brackets, i.e. δ(X) = {f } = f for X ∈ P(FC ↑).
Example 6.31. In the following we show the result of initial function on pair of terms :
δ({S , r · s S)}) = {1, r}
Definition 6.20 (Derivative). The Brzozowski derivative [23], denoted as u−1 S, of a set S
of strings and a string u is defined as the set of all the rest strings obtainable from a string
in S by cutting off its prefix u. In our context, importing the idea of Brzozowski, we define
the derivative, denoted by ∂, as a function which gets an ordered non factorized FC-term on
(V, <) and returns the term (set of terms) by cutting off the first features as follows :
∂ : FC ↑→ P(FC)
∂(S) = ∅
∂(f S) = {S}
∂(f · F S) = {F S}
∂(τ1 + τ2 ) = ∂(τ1 ) ∪ ∂(τ2 )
Note : Note that the functions initial (δ) and derivative (∂) are overloaded to the input,
depending to the input that if it is a tree or a term.
Definition 6.21 (Order of Attributes). We say attribute B is smaller or equal to attribute A
on the non factorized term τ ∈ FC ↑, denoted as B τ A, if the number of blocks of τ that B
divides, is less than (equal to) the number of blocks that A divides. Formally, B τ A implies
that :
|{τi ∈ block(τ ) | DA (B, τi ) = T rue}| ≤ |{τi ∈ block(τ ) | DA (A, τi ) = T rue}|
Given a set of attributes A and a term τ , the set (A, τ ) is a lattice. We denote the upper
bound of this set as uA,τ . This means that we have : ∀ A ∈ A ⇒ A τ uA,τ .
Example 6.32. In the following we show how the order of attributes of a term is identified.
Suppose the term τ = r · s S + r · c S + b · s S is given. We have :
block(τ ) = {r · s S, r · c S, b · s S}
consequently,
|{τi ∈ block(τ ) | DA (shape, τi ) = T rue}| = 1
≤ |{τi ∈ block(τ ) | DA (size, τi ) = T rue}| = 2
≤ |{τi ∈ block(τ ) | DA (color, τi ) = T rue}| = 3
110
which means that we have :
shape τ size τ color
Recalling that not having the predefined order among features creates a problem in full abstraction of terms. To this end, here we propose a way to order the set of features which is
appropriate to our problem.
First of all, given a feature-cluster family term τ , we find the order of attributes according to
definition 6.21, whilst if for two arbitrary attributes A and A0 , we have A = A0 , without loss
of generality, we choose a strict order among them, say A ≺ A0 . Then in each attribute we
arbitrarily order the features. It is important that the features of smaller attribute be always
smaller than the features of greater attribute. For example, if size ≺ color, we consider the
order of features as small < large < blue < red, whilst all the features of color are greater
than all the features of size.
Definition 6.22 (Ordered Unification). Ordered unification (F) is a partial function from
P(A) × FC ↑ to FC ↓, which gets a set of attributes and a non factorized term ; it returns the
A
normal form of applying rewriting rule −
→O introduced in Definition 6.16, iteratively, based on
the order of attributes on received term, as follows :
F : P(A) × FC ↑→ FC
F(∅, τ ↑) = τ
F({A}, τ ↑) = τ ⇓A
F(A, τ ) = F(uA,τ , F(A − {uA,τ }, τ ↑))
The normal form of ordered unification is called a unified term. By F ∗ (τ ) we mean that F is
performed iteratively on the set of ordered attributes on τ to get the unified term.
Example 6.33. To find the unified form of τ1 = r · s S + r · c S + b · s S , we have :
F ∗ (τ1 ) = F({shape, color, size}, τ1 ↑)
= F(color, F(size, F(shape, τ1 ))) = r · s S + r · c +b · s S
Definition 6.23 (Component relation). Given two ordered non factorized FC-terms τ1 and
τ2 on (V, <), we define the component relation, denoted by ∼1 , as the first level comparison
of terms as the following :
τ1 ∼1 τ2 ⇔ δ(τ1 ) = δ(τ2 )
Proposition 6.34. The component relation is an equivalence relation on the set of ordered
non factorized FC-terms.
111
Démonstration. For ordered non factorized FC-terms τ1 , τ2 and τ3 , we have :
if
τ1 ∼1 τ1
if f
δ(τ1 ) = δ(τ1 )
if
τ1 ∼1 τ2 ∧ τ2 ∼1 τ1
if f
δ(τ1 ) = δ(τ2 ) ∧ δ(τ2 ) = δ(τ1 )
if
τ1 ∼1 τ2 , τ2 ∼1 τ3 then τ1 ∼1 τ3
if f
δ(τ1 ) = δ(τ2 ), δ(τ2 ) = δ(τ3 ) then δ(τ1 ) = δ(τ3 )
Definition 6.24 (Component). Let consider that the ordered term τ ∈ FC ↑ on (V, <) is
given. The equivalence class of τ 0 ∈ block(τ ) is called a component of τ , and it is formally
defined as :
[τ 0 ]τ = {τi ∈ block(τ ) | τ 0 ∼1 τi }
The set of all components of the term τ through the equivalence relation ∼1 , is denoted by
block(τ )/ ∼1 or simply τ / ∼1 , i.e. we have :
τ / ∼1 = {[τi ]τ | τi ∈ block(τ )}
Definition 6.25 (Component Order). Let X and Y be two sets of ordered non factorized
FC-terms on (V, <). We say X is smaller than Y , denoted as X < Y , if :
X<Y
⇔
∀f 0 ∈ δ(X), ∀f 00 ∈ δ(Y )
f 0 < f 00
Specifically, let τ be an ordered non factorized FC-term on (V, <). We order the components
of τ according to the order of features in V as what follows :
[τ 0 ]τ < [τ 00 ]τ
⇔
∀f 0 ∈ δ([τ 0 ]), ∀f 00 ∈ δ([τ 00 ])
f 0 < f 00
It is noticeable that |δ([τ 0 ])| = |δ([τ 00 ])| = 1, for all τ 0 , τ 00 ∈ block(τ ), since the first features of
all elements in a component are equal.
We denote the i’th component of τ / ∼1 as [τ ]i . Due to the fact that the features are ordered
strictly, the term components are also ordered strictly.
Definition 6.26 (Well formed term). Well formed function, denoted as W , is a binary function from FC ↑ to {T rue, F alse}, which gets a unified non factorized FC-term ; it returns
T rue if the set of first features of its components is equal to a sort of A to which these features
belong ; it returns F alse otherwise. Formally :
(
T rue if
δ(τ / ∼1 ) = sort(δ([τi ]τ ))
W (τ ) =
F alse
∀τi ∈ block(τ )
otherwise
where δ(τ / ∼1 ) = sort(δ([τ1 ]τ )) means that the the set of the first features of the components
of the term τ is equal to the attribute that the first feature belongs to. A unified term τ is called
a well formed term, if W (τ ) = T rue. An atomic term is a well formed term.
112
Example 6.35. The unified term of Example 6.33, τ = r · s S + r · c S + b · s S is a well
formed term, since we have :
δ(τ / ∼1 ) = δ({{r · s S , r · c S}, {b · s S}}) = {r, b}
sort(δ([r · s S])) = sort({r · s S , r · c S}) = {r, b}
consequently W (τ ) = T rue.
It is noticeable that in an ordered CCTree term all first features belong to the same attribute.
Hence, in what follows we exploit the concept of well formed term to identify whether a term
represents a CCTree term or not.
6.5.1
CCTree Term Schema
We know that each CCTree term is a feature-cluster family term. However, in converse, a
feature-cluster family term is not necessarily representing a CCTree term. It would be interesting to know which feature-cluster family term represents a CCTree term. This knowledge
provides us with the opportunity to iteratively use the rules on CCTree terms.
Theorem 6.36. A unified term represents a CCTree term, or it is transformable to a CCTree
structure, if and only if, it can be written in the following form :
X
F ∗ (τ ) =
fi · τi
(6.45)
i
such that “ W (F ∗ (τ )) = T rue”, i.e. the unified form of the received term is a well formed
term ; and the unified form of each τi is a well formed term as well (W (τi ) = T rue ) which
respects the above formula.
Démonstration. First we show that a unified term obtained from a CCTree structure satisfies
the equation 6.45. In a CCTree, the attribute used for division in the root, has the greatest
number of occurrence in non factorized CCTree term (all blocks of CCTree term contain one
of the features of this attribute). According to 6.37, for transforming the tree to a term, the
first features of components are specified from δ(T ) = {f | ∃ s0 ∈ Q s.t. (sT , f, s0 ) ∈ ω}, where
in CCTree all belong to the same sort, i.e. we have :
δ(T ) = {f | ∃ s0 ∈ Q s.t. (sT , f, s0 ) ∈ ω} = sort({f }) ⇒ W (ψ(T )) = T rue
we call the tree following a child of the root as a new tree. It is noticeable each new tree is
a CCTree by itself ; hence, it respects 6.45. By considering the tree following the new tree as
new trees themselves, the aforementioned process is iteratively repeated for all new trees, due
to the iterative structure of CCTree, i.e. from 6.37, we have :
∀ f ∈ δ(T )
W (ψ(∂f (T ))) = T rue
113
this means that if the input tree structure is a CCTree, then the obtained term respects the
above formula.
On the other hand, a unified term that respects equation 6.45 can be converted to a CCTree
structure. To this end, τi ’s are the components of τ separating their first features (fi ’s). The set
of the first features of components of the term, constitute the transitions of the first division
from the root of CCTree, i.e. :
X
X
∂(τk )) =
δ([τi ]) ·
Ω(
τk ∈[τi ]
[τi ]∈τ /∼1
[
{(S, δ([τi ]),
[τi ]∈τ /∼1
X
∂(τk ))}
τk ∈[τi ]
where S is the main dataset the term is originated from.
Since the term is well formed, it guarantees that the label of children belong to the same
sort, as required by CCTree. Due to the iterative rule for successive components, iteratively
the structure of CCTree is constructed. Note that the condition of equivalence of the first
features of components to a sort, guarantees that in the process of transforming the term to
its equivalent tree structure, all the features of a selected attribute exist.
With the use of above theorem, we propose a rewriting system which is applied to automatically
check if a term represents a CCTree term or not.
CCTree Rewriting System
To verify automatically if a term is a CCtree term, a set of conditional rewriting rules are
provided in Table 6.1. The term ∅ in this table, refers to a null term. In this regard, the
CCTree rewriting system is applied on a received term ; the term is a CCTree term if the only
irreducible term is ∅.
In this rewriting system, Jf (τ )K means that the semnatics of f (τ ) is replaced, whilst the result
is considered as one unique term, not several terms. Furthermore, τ1 : τ2 contains two terms τ1
and τ2 , whilst each one is considered as a new term. Moreover, [τ ]i refers to the i’th component
of τ / ∼1 .
(1) (τ ∈ A ) | τ → ∅
(2) (τ =
6 F ∗ (τ )) | τ → JF ∗ (τ )K
(3) (τ = F ∗ (τ )) ∧ (W (τ ))) ∧ (τ ∈
/ A ) | τ → JΣτk ∈[τ ]1 ∂(τk )K : . . . : JΣτk ∈[τ ]|τ /∼ | ∂(τk )K
1
Table 6.1 – CCTree Rewriting System
The first rule of Table 6.1 specifies that if a term is an atomic term it is directed to ∅. The
second rule expresses that if a term is not in unified form, it is required to transfer it to its
unified representation. The third rule specifies that if a non atomic unified term is well formed,
114
it is divided to the derivative of its components. The last rule is used to verify whether the
CCTree conditions satisfy for the following components or not. These rules are following the
structure of Theorem 6.36 in identifying whether a term is CCTree term or not.
Example 6.37. Suppose that the term τ1 = a1 S + b1 S, with the set of attributes A =
{a1 , a2 }, B = {b1 , b2 }, are given. We apply the CCTree rewriting rules to automatically verify
if τ1 is a CCTree term or not.
The term τ1 is not atomic. Moreover, we have τ1 = F ∗ (τ1 ) and (W (τ1 ) = F alse). There is
no CCTree rewriting rule which can be applied, whilst this term is not ∅. This means that the
received term τ1 is not a CCTree term.
Example 6.38. With the use of CCTree rewriting system, we show that the term τ2 = a1 S + a2 S with the set of attributes A = {a1 , a2 }, B = {b1 , b2 }, is a CCTree term.
(3)
(1)
(τ2 = F ∗ (τ2 )) ∧ (W (τ2 )) | a1 S + a2 S −−→ S : S −−→ ∅ : ∅
There is no irreducible term except ∅, hence, τ2 is a CCTree term.
6.5.2
Termination and Confluent Rewriting System
In the present section, we first present what the termination and confluent of a rewriting system mean. Furthermore, through several theorems, we prove our proposed rewriting system is
terminating and confluent.
Termination and confluence are the interesting properties of a rewriting system, which guarantee that firstly, applying the rewriting rules of the proposed system will not involve in an
infinite loop of application, and furthermore, applying the rewriting rules we always get a
unique result.
Termination and Confluence of Rewriting System A rewriting system is terminating,
if there is no infinite derivations a1 → a2 → a3 → . . . in R. This implies that every derivation
eventually ends to a normal form [43]. Lankford theorem claims that a rewriting system R
is terminating, if for some reduction ordering >, x > y for all rules x → y ∈ R. An order is
a reduction ordering, if it is monotonic and fully invariant [43]. A relation is monotonic if it
preserves the order through adding or reduction a term in both sides, and it is fully invariant,
if it preserves the order when a term is substitute in both sides of the relation [43].
An element a in the rewriting system R is locally confluent if for all b, c ∈ R such that a → b
and a → c, there exists d ∈ R such that b →∗ d and c →∗ d. If every a ∈ R is locally confluent,
then → is called locally confluent. Newman’s lemma expresses that a terminating rewriting
system is confluent if and only if it is locally confluent [43].
Theorem 6.39. The CCTree rewriting system is terminating.
115
Démonstration. To prove this theorem we first define a reduction order on the rules of CCTree
rewriting system. To this end, we define the size function which gets an FC-term and returns
the number of features appeared in the term as follows :
size : FC → N
size(S) = 1
size(f S) = 1
size(F · τ ) = |F | + size(τ )
size(τ1 + τ2 ) = size(τ1 ) + size(τ2 )
where we consider size(∅) = 0 and size(τ1 : τ2 ) = size(τ1 ) + size(τ2 ).
We say FC-term τ1 is less than FC-term τ2 , denoted by τ1 ≤ τ2 , if the number of features in τ1
is less than the number of features in τ2 , or equally size(τ1 ) ≤ size(τ2 ). This partial ordering
is well-founded, since there is no infinite descending chain (number of features are limited).
It is monotonic, because the property of number of features in two terms is preserved when a
term is added or reduced in both sides. Furthermore, the substitution in left and right sides,
preserves the order of number of features, i.e. it is fully invariant. Therefore, the proposed
ordering is a reduction ordering.
Considering that ∅ is a null term containing no feature, in the first rule we have atomic term >
∅. In the second one, the conditional rule is just applied when the term is not equal to its
unified form ; whilst the ordered unification function, if applied, does not change the number
of features, i.e.
τ ≥ F ∗ (τ ) f or
τ 6= F ∗ (τ )
since size(τ ) = size(F ∗ (τ )). Worth noticing that this rule is a one step rule, such that when
the term is unified, the other rules are exploited.
In the third rule, the first features of all components of the left term are removed, i.e. the size
(number of features) of the left-hand term is greater than the size (number of features) in the
right-hand one. Hence, the proposed reduction ordering ≤ on CCTree rewriting system, shows
that the system is terminating.
Theorem 6.40. The CCTree rewriting system is locally confluent.
Démonstration. In CCTree rewriting system, all rules are conditional and there is no term for
which two (or more) conditions are satisfied at the same time. This means that the possibility
of having τ → τ1 and τ → τ2 where τ1 6= τ2 , does not happen. Hence, the rewriting system is
locally confluent.
Theorem 6.41. The CCTree rewriting system is confluent.
116
Démonstration. According to Newman’s lemma, the CCTree rewriting system being terminating ( Theorem6.39) and locally confluent (Theorem 6.39), it is confluent.
6.6
CCTrees Parallelism
It is not uncommon that a data mining process requires several days or weeks to be completed.
Parallel computing systems bring significant benefit, say high performance, in implementation
of massive database [33]. Parallel clustering is a methodology proposed to alleviate the problem of time and memory usage in clustering large amount of data [94], [18].
SPMD (Single Program Multiple Data) parallelism is the most common approach in parallel
computation [135]. In SPMD parallel algorithm, multiple computers implement the same algorithm on different subsets and exchange the partial results to merge to a final result.
In the present work, we propose SPMD parallelism of CCTrees in terms of a rewriting system.
To this end, a large amount of data desired to be clustered is divided into two (or more)
parallel computers, where each computer clusters the received dataset with the use of CCTree algorithm. The result of each CCTree is transformed to its equivalent CCTree term. The
resulted CCTree terms are reported to master computer for composition. The CCTree terms
are composed automatically based on our proposed composition rewriting rules (6.2). The
Figure 6.2 – Parallel Clustering Workflow.
composition result is reported to each computer to homogenize the all CCTree terms, and
consequently the structure of all CCTrees (Figure 6.2).
Getting a CCTree term from the composition of received terms, provides us with two advantages : First, the process of parallelism can be continued iteratively. Furthermore, it explains
how the set of clusters resulted from two (or more) CCTrees can be merged.
To address the composition process, a set of composition rewriting rules (Table 6.2) are proposed to get automatically a CCTree term when a term is not a CCTree term.
117
The split relation, the 4’th rule of Table 6.2, is added to the rules of Table 6.1 to get CCTree
term from non CCTree term.
Definition 6.27 (Split). Let a unified term τ ∈ FC ↑ on (V, <) and the set of attributes A,
is given. Considering uA,τ as the upper bound attribute of τ , we define the split relation as
what follows :
(
split(τ ) =
where :
(
ζ(τi ) =
τ
P
if
τi ∈block(τ ) ζ(τi )
τi
P
( ai ∈uA,τ ai ) · τi
if
if
if
W (τ ) = T rue
W (τ ) = F alse
DA (A, τi ) = T rue
DA (A, τi ) = F alse
This means that all blocksof τ which do not contain any feature of uA,τ are multiplied to the
addition of the features of uA,τ .
In the following examples we show how split relation is applied.
Example 6.42. Lets consider τ1 = r · s S + r · c S + b · s S, is given. We have W (r · s S + r · c S + b · s S) = T rue, i.e. τ1 is a well formed term, which results in :
split(r · s S + r · c S + b · s S) = r · s S + r · c S + b · s S
Example 6.43. Suppose the term τ2 = r · s S + c S + b S is given. We have :
W (r · s S + r · c S + b S) = F alse
hence, τ2 is not a well formed term. Considering uA,τ2 = color we have :
DA (colro, r · s S) = T rue
DA (colro, r · c S) = T rue
DA (colro, b S) = F alse
which results in :
split(r · s S + c S + b S) = r · s S + (r + b) · c S + b S
=r·sS+r·cS+b·cS+bS
It is worth noticing that when a term is not a CCTree term, it is possible to infer it from its
unified form when the first features of its components do not belong to the same attribute.
Therefore, the split rule is proposed to create a well formed term from a non CCTree term.
In what follows, we add the split rule to the previous rewriting system, which is used when a
term is not a CCTree term to obtain a CCTree term.
118
6.6.1
Composition Rules
The composition rewriting rules to get a CCTree term from a non CCTree term is presented
in Table 6.2. In the proposed rewriting system, Jf (τ )K means that the semnatic of f (τ ) is
replaced, whilst the result is considered as one unique term, not several terms. Furthermore,
τ1 : τ2 contains two terms τ1 and τ2 , whilst each one is considered as a new term. Moreover,
[τ ]i refers to the i’th component of τ / ∼1 .
(1)
(2)
(3)
(4)
(τ
(τ
(τ
(τ
∈A) | τ →∅
6= F ∗ (τ )) | τ → JF ∗ (τ )K
= F ∗ (τ )) ∧ (W (τ )) ∧ (τ ∈
/ A ) | τ → JΣτk ∈[τ ]1 ∂(τk )K : . . . : JΣτk ∈[τ ]|τ /∼ | ∂(τk )K
1
∗
= F (τ )) ∧ (∼ W (τ )) | τ → Jsplit(τ )K
Table 6.2 – Composition Rewriting System
Comparing to Table 6.1, just the forth rule (split rule) is added. This rule guarantee that if a
term is not a CCTree term, how by splitting the term based on the upper bound attribute we
may get a CCtree term.
6.6.2
CCTree Term From Composition Rewriting Rules
Here we briefly explain how to find a CCTree term from non CCtree term with the use of
composition rewriting system. To this end, first of all, the set of attributes A describing the
received term τ is provided. Note that in categorical clustering algorithm, the set of attributes
are known beforehand. The set of attributes and non CCTree term are given to the composition
rewriting system. When the conditions of the rule (τ = F ∗ (τ ))∧(W (τ )) | τ → JΣτk ∈[τ ]1 ∂(τk )K :
. . . : JΣτk ∈[τ ]|τ /∼ | ∂(τk )K respects for a term τ , we save τ . Then all JΣτk ∈[τ ]i K of τ are replaced
1
by their own successive terms respecting this rule. This process is repeated iteratively till
reaching to atomic term in all components of term. The result of this term is the desired
CCTree term.
Example 6.44. Suppose that the addition of two CCTree terms is given as τ = a1 S + a2 S + b1 S 0 + b2 S 0 , with the set of attributes A = {a1 , a2 }, B = {b1 , b2 }.
It is easy to verify that τ is not a CCTree term from the rules of Table 6.1.
We are interested to find a CCTree term from received non CCTree term τ , with the use of
119
composition rewriting system. To this end we have :
(4)
(i) (τ = F ∗ (τ )) ∧ (∼ W (τ )) | τ −−→ Jsplit(τ )K
(ii) Jsplit(τ )K = τ 0 = a1 S + a2 S + (a1 + a2 ) · b1 S 0 + (a1 + a2 ) · b2 S 0
(2)
(iii) (τ 0 6= F ∗ (τ 0 )) | τ 0 −−→ JF ∗ (τ 0 )K = (a1 · (S + b1 S 0 ) + a2 · (S + b1 S 0 )) = τ 00
∗(3)∗
(iv) (τ 00 = F ∗ (τ 00 )) ∧ (W (τ 00 )) | τ 00 −−−→ S + b1 S 0 (I) : S + b1 S 0 (II)
(4)
(2)
(I) S + b1 S 0 −−→ (b1 + b2 ) · S + b1 S 0 −−→ b1 · (S + S 0 ) + b2 S
∗(3)∗
(1)
−−−→ S + S 0 : S −−→ ∅ : ∅
(4)
(2)
(II) S + b1 S 0 −−→ (b1 + b2 ) · S + b1 S 0 −−→ b1 · (S + S 0 ) + b2 · S
∗(3)∗
(1)
−−−→ S + S 0 : S −−→ ∅ : ∅
To find the resulted CCTree term, we consider the terms respecting the rule (3), shown with
∗(3)∗. Hence, we have them as follows :
(∗)
a1 · (S + b1 S 0 ) + a2 · (S + b1 S 0 )
(∗∗)
b1 · (S + S 0 ) + b2 S
(∗ ∗ ∗)
b1 · (S + S 0 ) + b2 · S
Then since (∗∗) results from this term S + b1 S 0 inside (∗), and (∗ ∗ ∗) from term S + b1 S 0
inside (∗), we replace them to their previous form :
a1 · (b1 · (S + S 0 ) + b2 S) + a2 · (b1 · (S + S 0 ) + b2 · S)
Since there is no more term respecting rule (3), the above term is the desired CCTree term.
It is easy to automatically verify that the resulted term is a CCTree term according to Table
6.1.
6.6.3
CCTree Homogenization
After that the final CCTree term, resulting from the composition of two (or more) CCTree
terms, is returned to parallel devices, the CCTree term of each computer has to be extend
to the final CCTree term. The extension of each CCTree term to a final CCTree term will
homogenize the structure of all CCTrees. To this end, it is enough to add a CCTree term
with the final CCTree term. Then, all split rules applied on CCTree term in the process of
120
its composition with final CCTree term, shows the required split in the associated CCTree
structure, following the procedure of transforming a term to tree provided in 6.4.2.
Note It is worth noticing that after homogenizing all the CCTrees to the final CCTree, the
data respecting the same set of features go to the same cluster of final CCTree. However,
merging a lot of data points from different clusters of different CCTrees to one cluster, may
cause that the final nodes not respect required purity. To solve this issue, after merging the
data, the purity of each final node should be computed, and if not pure enough, it requires to
be split based on the CCTree rules of construction.
Theorem 6.45. The composition rewriting system is terminating.
Démonstration. The only rule added to composition rewriting system comparing to CCTree
rewriting system, is the rule split. We show that split rule is not contradicting the termination
and confluence of rewriting system. First of all, the split rule is one step rule, i.e. the result
of split rule, after one step application, is considered as the premise of other rules (which
decreases the term). On the other hand, on each term, the split rule is applied at most equal
to the number of attributes (finite). Hence, since the split by itself is one step rule, and for
each term it is called finite times, the composition rewriting system is terminating.
Theorem 6.46. The composition rewriting system is locally confluent.
Démonstration. There is no term respecting at the same time two (or more) conditions of
composition rewriting system, i.e. there is no term τ for which τ → τ1 and τ → τ2 , where
τ1 6= τ2 . This means that composition rewriting system is locally confluent.
Theorem 6.47. The composition rewriting system is confluent.
Démonstration. From Theorems 6.45 and 6.46, the composition rewriting system is terminating and locally confluent, respectively. Hence, from Newman’s lemma, the composition
rewriting system is confluent.
6.6.4
Time Complexity
Here we present a theorem which calculates the time complexity of constructing several CCTree
in parallel devices.
Theorem 6.48. Let us consider n to be the total number of elements desired to be clustered,
r be the number of attributes, vmax be the maximum number of values in an attribute, and K
be the maximum number of non leaf nodes. The time complexity of constructing CCTrees in t
parallel devices equals to :
1
· O(K × (n × m + n × vmax ))
t
121
Démonstration. In Section 3.5, we explained about calculating the time complexity of constructing the CCTree. Recalling again, consider n as the number of elements in whole dataset, ni
be the number of elements in node i, m be the total number of features, vl the number of
features of attribute Al , r the number of attributes, and vmax = max{vl }, (1 ≤ l ≤ r).
For constructing a CCTree, if K = m + 1 be the maximum number of non leaf nodes, which
arise in a complete tree, then the maximum time required for constructing a CCTree with n
elements equals to O(K × (n × m + n × vmax )).
Now if we equally divide the dataset containing n points to t devices, it takes O(K × ((n/t) ×
m + (n/t) × vmax )) =
1
t
· O(K × (n × m + n × vmax )) to create t CCTrees, i.e the whole
required time will be divided to the number of devices. The other part of algebraic calculations
requires constant time.
6.7
Conclusion
In this chapter, a semiring-based formal method, named Feature-Cluster Algebra, is proposed
to abstract the representation of a categorical clustering algorithm, named CCTree.
The abstraction theory is a delightful mathematical concept, which constructs a brief sketch
of the original representation of a problem to deal with it easier. More precisely, abstraction
is the process of mapping a representation of a problem, called the ground (semantic), onto a
new representation, called abstract (syntax) representation, in a way that it is possible to deal
with the problem in the original space by preserving certain desirable properties and in a simpler way to handle, since it is constructed from ground representation by removing unwanted
detail. The abstraction process is performed with the use of a powerful algebraic structure, named semiring. Through several theorems and examples, we show that the proposed approach,
under some conditions, fully abstracts the CCTree structure. The full abstraction property
guarantees that the semantic and syntax forms of a problem can be used alternatively, whilst
preserving the required properties.
Furthermore, we presented a set of functions and relations on feature-cluster algebra, which
is used to present the CCTree schema in general. We provided a rewriting system which automatically identifies whether a term represents a CCTree or not.
The CCTree abstract representation is used in CCTree parallel clustering. Generally, the process of clustering requires time and space, specially when a large amount of data are desired
to be analyzed. The problem of time and precision in clustering becomes more challenging in
security issues, where the fast and precise analysis is required to find the strategies against
intruder.
We proposed a rewriting system which automatically returns a CCTree term, in a way that
all CCTrees in parallel devices can be generalized to.
The termination and confluence of the proposed rewriting system have been proved, which
guarantees first of all we have no loop in applying the proposed rewriting systems, and mo-
122
reover, the resulted final term is unique.
To the b est of our knowledge, the proposed technique in this chapter is a novel methodology
in applying algebraic structure in formalizing a clustering algorithm representation and addressing the associated issues. The proposed approach can be extended to other feature-based
clustering and classification algorithms.
123
Chapitre 7
Conclusions and Future Work
In present section, we first summarize what we presented in this work, and afterwards, we
present the future directions for continuing the present study.
7.1
Thesis Summary
The current strategies to minimize the impact of spam messages mostly focus on stopping
spam messages to be delivered to end user inbox. This kind of analysis, although being quite
effective in decreasing the cost of spam emails, does not stop spammers, who still impose
non negligible cost to users and companies. The reason could be that the spammer, the root
of the problem, finds the minimum risk to be followed, whilst he has the possibility to send
millions of messages in a short period of time with minimum expenses. To this end, analyzing
a spammer behavior to find the strategies against and may be persecuting him, becomes an
important issue in spam forensics. However, such an effort requires a first analysis of huge
amount of spam messages, collected in a short period of time in honey-pots, whilst the size is
magnified after some minutes.
To address this issue, in this thesis, we first proposed a categorical clustering algorithm,
named CCTree, to group large amount of spam messages into smaller groups, based on the
structural similarity. CCTree has a tree-like structure, where the root node of the tree contains
all spam messages. The CCTree divides spam messages, step-by-step, grouping together the
similar data and obtaining homogeneous subsets of data points. The measure of similarity of
clustered data points at each step of the algorithm is given by an index called node purity.
If the level of purity is not sufficient, it means spam messages belonging to this node are not
sufficiently homogeneous and they should be divided into different subsets (nodes) based on
the characteristic (attribute) that yields the highest value of entropy. The rationale under this
choice is that dividing data on the base of the attribute which yields the greatest entropy
helps in creating more homogeneous subsets where the overall value of entropy is consistently
reduced. This approach aims at reducing the time needed to obtain homogeneous subsets. The
124
division process of non homogeneous sets of data points is repeated iteratively till all sets are
sufficiently pure or the number of elements belonging to a node is less than a specific threshold
identified by the user. These pure sets are the leaves of the tree and will represent the desired
spam campaigns.
To apply CCTree in clustering large amount of spam emails into spam campaigns, we provided
a set of 21 categorical features representative of email structure. Then, through analysis on
200k spam emails, we proposed and validate a methodology to choose the optimal CCTree
parameters based on detection of max curvature point (knee) on a homogeneity-number of
clusters graph. We proved the effectiveness of CCTree in spam campaign detection through
internal evaluation, to estimate the ability in obtaining homogeneous clusters and external
evaluation, for the ability to effectively classify similar elements (emails), when classes are
known beforehand. The efficiency of CCTree has been shown through the comparison to one
of the fast well-known categorical clustering algorithm.
We proposed a framework, named Digital Waste Sorter (DWS), which exploits a self learning
goal of the spammer -based approach for spam email classification. The proposed approach
aims at automatically classifying large amount of raw unclassified spam emails dividing them
into campaigns and labeling each campaign with its spammer goals. To this end, we proposed
five class labels to group spammer goals in five macro-groups, namely Advertisement, Portal
Redirection, Advanced Fee Fraud, Malware Distribution and Phishing. Moreover, a set of 21
categorical features representative of email structure is proposed to perform a multi-feature
analysis aimed at identifying emails related to a large range of cybercrimes. DWS is based
on the cooperation of unsupervised and supervised learning algorithms. Given a set of classes
describing different spammer goals and a dataset of non classified spam emails. First, the proposed approach automatically creates a valid training set for a classifier exploiting CCTree.
DWS is built on the result of CCTree , which is effective in dividing spam emails in homogeneous clusters. Afterward, significant spam campaigns useful in the generation of the training
set are selected through similarity with a small set of known emails, representative of each
spam class. Hence, a classifier is trained using the selected campaigns as training set, and will
be used to classify the remaining unclassified emails of the dataset. Furthermore, we propose
six features, including the label of campaigns discovered with DWS, to automatically rank a
set of spam campaigns according to investigator priorities.
Finally, to abstract CCTree representation, we proposed a semiring-based approach, named
feature-cluster algebra. Several interesting relations and functions are defined on the abstract
schema of CCTree, named CCTree term. The concept of CCTree term is applied in the formalization of CCTree parallelism, which is expressed in terms of rewriting system. Clustering
parallelism can be used to speed up the process of grouping large amount of data in parallel
devices.
125
To summarize, we have to say that what we proposed in this thesis can be used as a tool for
cyber crime investigators to organize automatically a huge amount of spam messages in a short
period of time. This tool provides the investigator with the priority of the most dangerous
spammers, trough best ranked spam campaigns, required to be followed.
7.2
Future work
This thesis can be extended in several directions. In what follows we present what we plan to
extend.
The technique that we proposed in this thesis can be applied as a useful tool in automatic fast
detection of the most dangerous spam campaigns. To show the efficiency and effectiveness of
our proposed approach, we plan to apply it on a huge amount of spam messages, containing
one of the most dangerous current spam campaigns, e.g. cryptowall 3.0 malware. We plan to
show that our approach detects it automatically among other campaigns.
To speed up the process of clustering spam emails into campaign, we expect to apply several
sampling algorithms. In statistics, sampling approach is concerned with the selection of a
subset of elements for which the statistical properties of dataset is preserved, and it is applied
to estimate characteristics of the whole population.
In the concept of spam messages, since we always encounter a large amount of data, finding the
best strategy in sampling data from whole dataset, which preserves the main characteristics
of the whole dataset may help to speed up the analysis.
Furthermore, we plan to apply the proposed methodology in detecting, labeling, and ranking
social spam campaigns, e.g. Facebook or Twitter. To this end, first of all, the representative
features of social spam campaigns should be identified. Afterwards, the most popular cybercrimes in social networks should be characterized as the label of discovered spam campaigns to
train a classifier. Finally, the ranking features needs to be identified to order the set of social
spam campaigns.
Another area of research which we are interested to apply our proposed methodology refers
to botnet detection and finding the botmaster, the root of the problem. To this end, although
many efforts have been done in prosecuting the botmaster through botnet, we expect our
proposed approach works well in botnet detection through precise spam campaign detection
and consequently catch the spammer. The reason is that we believe the proposed mechanism
is able to precisely identify the zombies (bots) controlled by the same spammer (botmaster).
In the side of formalization, there are a lot of directions to extend our proposed approach, since
it is among the very first efforts in applying formal methods in clustering algorithms. First,
we plan to extend the idea of semiring in abstracting the representation of other well-known
categorical clustering algorithms. Then, we apply the abstract schema in concepts related
126
to feature analysis, parallel clustering, etc. Furthermore, we plan to apply more interesting
properties of semiring, to address more issues in categorical clustering algorithms. For example,
semiring homomorphism can be applied in automatically identifying whether two categorical
clustering are identical or not.
127
Publications
• Sheikhalishahi, M., Mejri, M., and Tawbi, N. (2015). Clustering spam emails into campaigns.
In Library, S. D., editor, 1st International Conference on Information Systems Security and
Privacy [126].
• Sheikhalishahi, M., Saracino, A., Mejri, M., Tawbi, N., and Martinelli, F. (2015c). Fast
and effective clustering of spam emails based on structural similarity. In 8th International
Symposium on Foundations and Practice of Security [129].
• Sheikhalishahi, M., Saracino, A., Mejri, M., Tawbi, N., and Martinelli, F. (2015b). Digitalwaste sorting : A goal-based, self-learning approach to label spam email campaigns. In 11th
International Workshop on Security and Trust Management [128].
• Sheikhalishahi, M., Mejri, M., and Tawbi, N. (2016). On the abstraction of a categorical
clustering algorithm. In Machine Learning and Data Mining in Pattern Recognition - 12th
International Conference [127].
128
Table 7.1 – Table of Notations
µ
A
VA
V
sort
,
,
,
,
,
,
F
F1
S
F S
FC
A
block
≡
C
,
,
,
,
,
,
,
,
,
,
A
−
→
→
−d
FC ↓
FC ↑
(Σ, Q, δ)
[[.]]
,
,
,
,
,
,
Ψ
,
GV,FC
,
≈
DA
δ
B ≺τ A
k
F
∂
[τ ]i
W (τ )
∗
F (A, τ )
split(τ )
,
,
,
,
,
,
,
,
,
,
Node purity
Minimum number of elements in a node
The set of sorts (attributes)
The carrier set of sort A
The union set of carrier sets of A
A function which returns a set of carrier sets of received
features
The power set of the power set of V
A subset of F in which each set contains just one element
The set of records (elements)
Satisfaction relation
The set of elements of S that satisfy the set of features F
Set of feature-cluster terms
Atomic terms
A function which returns a set of feature-cluster terms
FC-term comparison
The set of terms
Factorization rewriting rule
Defactorization rewriting rule
The set of factorized FC-terms
The set of non factorized FC-terms
Graph structure
A function which returns a tree from received feature-cluster
family term
A function which returns a feature-cluster family term from
a received forest (tree)
The set of all possible forest on the set of edge labels V and
node labels FC
Ordered FC-terms comparison
Attribute division function
Initial function
Attribute B is smaller than attribute A
Ordered unification function
Derivative function
The i’th component of τ
Well formed term
Unified term
Split function
129
Annexe A
Appendix
A.1
Source Codes of Proposed Approach
In what follows some important source codes used in CCTree construction, labeling, and etc.
are provided.
Shannon entropy function
e n t r o p y = shannon_entropy ( a t t r i b u t e _ v a l s )
%INPUT :
%a t t r i b u t e _ v a l s : [ 1 ∗N] INTEGER
%I s t h e v e c t o r with t h e v a l u e s f o r each a t t r i b u t e i n s i d e a c l u s t e r
%OUTPUT:
%e n t r o p y : [ 1 ∗ 1 ] DOUBLE
%
The e n t r o p y f o r t h e s p e c i f i c a t t r i b u t e .
f u n c t i o n e n t r o p y = shannon_entropy ( a t t r i b u t e _ v a l s )
ordered_vect = s o r t ( a t t r i b u t e _ v a l s ) ;
%Order t h e a r r a y t o d i v i d e t h e d i f f e r e n t v a l u e s o f t h e a t t r i b u t e
vector_size = size ( attribute_vals ) ;
i =0;
w h i l e isempty ( o r d e r e d _ v e c t)== 0
%Find t h e number o f e l e m e n t s f o r each a t t r i b u t e v a l u e i n t h e v e c t o r
i=i +1;
130
i n d e x = f i n d ( o r d e r e d _ v e c t == o r d e r e d _ v e c t ( 1 ) ) ;
temp = s i z e ( i n d e x ) ;
dim ( i ) = temp ( 2 ) ;
ordered_vect ( index ) = [ ] ;
end
e n t r o p y =0;
c o u n t e r = s i z e ( dim ) ;
f o r j = 1 : c o u n t e r ( 2 ) %compute t h e e n t r o p y
e n t r o p y=e n t r o p y − ( ( dim ( j ) / v e c t o r _ s i z e ( 2 ) ) ∗ l o g 2 ( dim ( j ) / v e c t o r _ s i z e ( 2 ) ) ) ;
end
The Shannon Entropy of a Cluster :
f u n c t i o n e = c l u s t e r i n g _ e n t r o p y (A, c i )
num_clusters = s i z e (A ) ;
num_clusters = num_clusters ( 2 ) ;
result = 0;
f o r i =1: num_clusters
i f not ( isempty (A{ i } ) )
num_cols_ai = s i z e (A{ i } ) ;
num_cols_ai = num_cols_ai ( 2 ) ;
v e c t _ a i = A{ i } ( : , num_cols_ai ) ’ ;
num_cols_ci = s i z e ( c i ) ;
num_els_ci = num_cols_ci ( 1 ) ;
num_cols_ci=num_cols_ci ( 2 ) ;
v e c t _ c i = c i ( : , num_cols_ci − 1 ) ’ ;
i n t e r s e c t i o n = i n t e r s e c t ( vect_ai , v e c t _ c i ) ;
dim = s i z e ( i n t e r s e c t i o n ) ;
dim = dim ( 2 ) ;
i f ( dim~=0)
r e s u l t = r e s u l t + ( dim/ num_els_ci ) ∗ l o g ( dim/ num_els_ci ) ;
end
end
end
e = −r e s u l t ;
131
end
Node purity
f u n c t i o n [ np , max_entropy_attribute ] = node_purity ( data , w e i g h t )
n_attr = s i z e ( data ) ;
n_attr = n_attr (2) −3;
i f nargin < 2
w e i g h t = o n e s ( 1 , n_attr ) ∗ 1 / n_attr ;
end
np=0;
max_entropy = 0 ;
max_entropy_attribute =1;
f o r i =1: n_attr −1
temp_entropy = shannon_entropy ( data ( : , i ) ’ ) ;
i f temp_entropy > max_entropy
max_entropy = temp_entropy ;
max_entropy_attribute=i ;
end
np=np+w e i g h t ( i ) ∗ temp_entropy ;
end
CCTree function :
function [ clusters , labels ] =
CCTree ( data , node_purity_threshold , max_num_elem)
tic
num_elem = s i z e ( data ) ;
num_elem = num_elem ( 1 ) ;
a s s o c i a t e _ v e c t o r = 1 : num_elem ;
a s s o c i a t e _ v e c t o r = a s s o c i a t e _ v e c t o r ’ ; %count t h e e m a i l l i n e s
data = [ data , a s s o c i a t e _ v e c t o r ] ;
l e v e l = 0 ; %i n i t i a l i z e data s t r u c t u r e s
nodes_per_level = {};
nodes_n e x t_ l ev e l = { } ;
a l l _ n o d e s ={};
leaves = {};
[ current_node_purity , c u r r e n t _ a t t r i b u t e ] =node_purity ( data ) ;
%compute node p u r i t y o f t h e whole d a t a s e t
num_elem_curr_node = s i z e ( data ) ;
132
num_elem_curr_node = num_elem_curr_node ( 1 ) ;
%check number o f e l e m e n t s
if
current_node_purity > n o d e _ p u r i t y _ t h r e s h o l d
&& num_elem_curr_node > max_num_elem
%s p l i t i f s e t i s NOT pure AND t o o many e l e m e n t s
[ nodes_per_level , l a b e l s ] = CCTreeSplit ( data , c u r r e n t _ a t t r i b u t e ) ;
%n o d e s _ p e r _ l e v e l c o n t a i n s t h e v a r i o u s c l u s t e r s
level = 1;
else
c l u s t e r s = data ;
labels = [ ] ;
return ;
end
while 1
num_nodes_curr_level = s i z e ( n o d e s _ p e r _ l e v e l ) ;
num_nodes_curr_level = num_nodes_curr_level ( 2 ) ;
new_level =0;
%b o o l e a n t o check i f t h e r e i s a new l e v e l
pd3=n o d e s _ p e r _ l e v e l ;
f o r i =1: num_nodes_curr_level
%f o r a l l nodes i n t h i s l e v e l
temp_node = n o d e s _ p e r _ l e v e l { i } ; %e x t r a c t a c l u s t e r
num_elem_curr_node=s i z e ( temp_node ) ;
num_elem_curr_node=num_elem_curr_node ( 1 ) ;
[ current_node_purity , c u r r e n t _ a t t r i b u t e ] = node_purity ( temp_node ) ;
%compute p u r i t y
if
current_node_purity > n o d e _ p u r i t y _ t h r e s h o l d
&& num_elem_curr_node > max_num_elem
%i f s e t i s pure OR t h e r e a r e t o o few e l e m e n t s
[ temp_cell_array , temp_label ]= CCTreeSplit ( temp_node , c u r r e n t _ a t t r i b u t e ) ;
%s p l i t and a s s i g n t o temp v a r i a b l e t h e new c l u s t e r
nodes_n e x t_ l ev e l =[ nodes_next_level , temp_cell_array ] ;
%add node t o a d e e p e r l e v e l
new_level =1;
133
num_nodes_curr_level = s i z e ( n o d e s _ p e r _ l e v e l ) ;
num_nodes_curr_level = num_nodes_curr_level ( 2 ) ;
new_level =0;
%b o o l e a n t o check i f t h e r e i s a new l e v e l
pd3=n o d e s _ p e r _ l e v e l ;
f o r i =1: num_nodes_curr_level
%f o r a l l nodes i n t h i s l e v e l
temp_node = n o d e s _ p e r _ l e v e l { i } ;
%e x t r a c t a c l u s t e r
num_elem_curr_node=s i z e ( temp_node ) ;
num_elem_curr_node=num_elem_curr_node ( 1 ) ;
[ current_node_purity , c u r r e n t _ a t t r i b u t e ] =node_purity ( temp_node ) ;
%compute p u r i t y
if
current_node_purity > n o d e _ p u r i t y _ t h r e s h o l d
&& num_elem_curr_node > max_num_elem
%i f s e t i s pure OR t h e r e a r e t o o few e l e m e n t s
[ temp_cell_array , temp_label ]= CCTreeSplit ( temp_node , c u r r e n t _ a t t r i b u t e ) ;
%s p l i t and a s s i g n t o temp v a r i a b l e t h e new c l u s t e r
node s _n e x t_ l ev e l =[ nodes_next_level , temp_cell_array ] ;
%add
node t o a d e e p e r l e v e l
new_level =1;
%d i s p ( ’ l e a f e found ’ ) ; %c r e a t e a l e a f
else
%d i s p ( ’ l e a f e not found ’ ) ;
l e a v e s= [ l e a v e s ; temp_node ] ;
%add i t t o t h e l e a f c o l l e c t i o n
end
end
clusters = leaves ;
%a s s i g n t h e l e a v e s t o t h e r e s u l t s
all_nodes = [ all_nodes nodes_per_level ] ;
n o d e s _ p e r _ l e v e l=n o de s _ ne x t_ l e ve l ;
%next l e v e l becomes c u r r e n t l e v e l
node s _n e x t_ l ev e l = { } ;
134
i f new_level==0
%s t o p i f a l l nodes a r e l e a v e s
break ;
end
level = level + 1;
end
toc
CCTree Labeling Function :
f u n c t i o n M = CreateCCTreeLabelledMatrix ( c )
iter = size (c );
iter = iter (1);
M= [ ] ;
f o r i =1: i t e r
numofelements = s i z e ( c { i } ) ;
numofelements = numofelements ( 1 ) ;
v e c t = i ∗ o n e s ( numofelements , 1 ) ;
tempmat = c { i } ;
tempmat = [ tempmat , v e c t ] ;
i f ( i ==1)
M=tempmat ;
else
M = [M; tempmat ] ;
end
end
end
Precise cluster
f u n c t i o n p = p r e c i s i o n _ c l u s t e r ( Ai , Cj )
num_cols_ai = s i z e ( Ai ) ;
%num_el_ai = num_cols_ai ( 1 ) ;
num_cols_ai = num_cols_ai ( 2 ) ;
num_cols_cj = s i z e ( Cj ) ;
num_el_cj = num_cols_cj ( 1 ) ;
num_cols_cj = num_cols_cj ( 2 ) ;
v e c t _ a i = Ai ( : , num_cols_ai ) ’ ;
v e c t _ c j = Cj ( : , num_cols_cj − 1 ) ’ ;
i n t e r s e c t i o n = i n t e r s e c t ( vect_ai , v e c t _ c j ) ;
135
r e s u l t=s i z e ( i n t e r s e c t i o n ) ;
p=r e s u l t ( 2 ) / num_el_cj ;
end
end
Recall cluster
f u n c t i o n r = r e c a l l _ c l u s t e r ( Ai , Cj )
num_cols_ai = s i z e ( Ai ) ;
num_el_ai = num_cols_ai ( 1 ) ;
num_cols_ai = num_cols_ai ( 2 ) ;
num_cols_cj = s i z e ( Cj ) ;
num_cols_cj = num_cols_cj ( 2 ) ;
v e c t _ a i = Ai ( : , num_cols_ai ) ’ ;
v e c t _ c j = Cj ( : , num_cols_cj − 1 ) ’ ;
i n t e r s e c t i o n = i n t e r s e c t ( vect_ai , v e c t _ c j ) ;
r e s u l t=s i z e ( i n t e r s e c t i o n ) ;
r=r e s u l t ( 2 ) / num_el_ai ;
end
Find Clusters by Purity :
f u n c t i o n [ index , p u r i t y ]= F i n d C l u s t e r B y P u r i t y ( data , l e a v e s )
num_of_leaves = s i z e ( l e a v e s ) ;
num_of_leaves = num_of_leaves ( 1 ) ;
tot_el = s i z e ( cell2mat ( leaves ) ) ;
tot_el = tot_el ( 1 ) ;
min_purity = I n f ;
i n d e x = −1;
nattr_leaf = size ( leaves {1});
nattr_leaf = nattr_leaf (2);
nattr_data = s i z e ( data ) ;
nattr_data = nattr_data ( 2 ) ;
s i z e _ d i f f = n a t t r _ l e a f − nattr_data ;
data = [ data , z e r o s ( 1 , s i z e _ d i f f ) ] ;
%add two empty v a l u e s t o match t h e s i z e o f l e a f
f o r i =1: num_of_leaves
num_of_elements = s i z e ( l e a v e s { i } ) ;
num_of_elements = num_of_elements ( 1 ) ;
i f ( num_of_elements > 1 )
136
%do not c o n s i d e r nodes with a s i n g l e e l e m e n t
p u r i t y _ o l d = node_purity_mod ( l e a v e s { i } ) ;
purity_new = node_purity_mod ( [ l e a v e s { i } ; data ] ) ;
%add data and compute new p u r i t y
d i f f e r e n c e = ( purity_new − p u r i t y _ o l d ) ;
d i f f e r e n c e = d i f f e r e n c e ∗ ( num_of_elements ) ;
%do not c o n s i d e r node whose p u r i t y i s i n c r e a s e d
i f d i f f e r e n c e < min_purity
min_purity = d i f f e r e n c e ;
index = i ;
end
end
end
p u r i t y = min_purity ;
end
F-Measure
f u n c t i o n f = FMeasure_Clusters ( Ai , c )
results = 0;
num_of_clusters = s i z e ( c ) ;
num_of_clusters = num_of_clusters ( 1 ) ;
f o r i =1: num_of_clusters
op = 2∗ p r e c i s i o n _ c l u s t e r ( Ai , c { i } ) ∗ r e c a l l _ c l u s t e r ( Ai , c { i } )
/ ( p r e c i s i o n _ c l u s t e r ( Ai , c { i })+ r e c a l l _ c l u s t e r ( Ai , c { i } ) ) ;
r e s u l t s = max( r e s u l t s , op ) ;
end
f = results ;
end
137
A.2
Tables of Attributes
In what follows the set of features of each attribute, and the range of each feature, which
are applied in CCTree algorithm are presented in tables. Each table represents one attribute,
whilst the first column of each table constitute the set of features of that attribute, and the
second column shows the number we assigned to each feature in the same raw.
The two binary attributes Linkwithat(@) and LinkswithnonASCIIcharacter are not presented in tables. For these two attributes, if presented in the body of spam message there is
no link with (@) or no link non ASCII character, we attribute the number 0 to this message,
otherwise the attributed number equals to 1.
Table A.1 – Language of spam message and subject
Language
Unknown language
English language
Italian language
French language
German language
Spanish language
Chinese language
Arabic language
Persian language
Japanese language
Russian language
Croatian language
Portuguese language
Indian language
Attributed Number
0
1
2
3
4
5
6
7
8
9
10
11
12
13
Table A.2 – Type of Attachment
Attachment Type
None
PDF
EXEC
DOC
PIC
TXT
ZIP
Other
Attributed Number
0
1
2
3
4
5
6
7
138
Table A.3 – Attachment Size
Attachment Size
Attachment Size 0 kb
Attachment Size 1-100 kb
Attachment Size 100-500 kb
Attachment Size 500-1000 kb
Attachment Size 1000-more kb
Attributed Number
0
1
2
3
4
Table A.4 – Number of attachment
Attachment Number
No attachment
1 attachment
2 attachments
3 attachments
4 attachments and more
Attributed number
0
1
2
3
4
Table A.5 – Average size of attachments
Average Attachment Size
average size of attachment 0
average size of attachment 1-100
average size of attachment 100-500
average size of attachment 500-1000
average size of attachment 1000 and more
Attributed Number
0
1
2
3
4
Table A.6 – Type of Message
Message Type
Plain Text
HTML based
Image based
Links Only
Others
Attributed Number
1
2
3
4
5
139
Table A.7 – Length of Message
Message Size
Length Class 0-100 kb
Length Class 100-200 kb
Length Class 200-300 kb
Length Class 300-400 kb
Length Class 400-500 kb
Length Class 500-600 kb
Length Class 600-700 kb
Length Class 700-800 kb
Length Class 800-900 kb
Length Class 900-1000 kb
Length Class 1000-5000 kb
Length Class 5000-10000 kb
Length Class 10000-20000 kb
Length Class 20000-30000 kb
Length Class 30000-40000 kb
Length Class 40000-50000 kb
Length Class 50000-60000 kb
Length Class 60000-70000 kb
Length Class 70000-80000 kb
Length Class 80000-90000 kb
Length Class 90000-100000 kb
Length Class 100000-more kb
Attributed Number
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Table A.8 – IP-based links verification
IP based Verification
No IP based links
Contain IP based links
Attributed Number
0
1
Table A.9 – Mismatch links
Mismatch Links
No Mismatch link
1 Mismatch Link
2 Mismatch Links
3 Mismatch links and more
140
Attributed Number
0
1
2
3
Table A.10 – Number of links
Number of Links
No link
1 link
2 links
3 links
4 links
5 links
6 links
7 links
8 links
9 links
10-100 links
more than 100 links
Attributed Number
0
1
2
3
4
5
6
7
8
9
10
11
Table A.11 – Number of Domains
Number of Domains
No domain
1 domain in links
2 domains in links
3 domains in links
4 domains in links
5 domains in links
6-10 domains in links
more than 10 domains in links
Attributes Number
0
1
2
3
4
5
6
7
Table A.12 – Average number of dots in links
Average Number of Dots in Links
0 dot per link
1 dot per link
2 dots per link
3 dots per link
more than 3 dots per link
Attributed Number
0
1
2
3
4
Table A.13 – Hex character in links
Number of links with Hex
No link with Hex character
1 link with Hex character
2 links with Hex character
3 links with Hex character
4 links with Hex character
5 links with Hex character
6-10 links with Hex character
more than 10 links with Hex character
141
Attributed Number
0
1
2
3
4
5
6
7
Table A.14 – Words in Subject
Number of Words in Subject
No word in subject
1-5 words in subject
6-10 words in subject
more than 10 words in subject
Attributed Number
0
1
2
3
Table A.15 – Characters in subject
Number of Characters in Subject
No character in subject
1-10 characters in subject
10-20 characters in subject
more than 20 character in subject
Attributed Number
0
1
2
3
Table A.16 – Non ASCII characters in subject
Number of Non ASCII characters in Subject
No non ASCII character in subject
1 non ASCII character in subject
2-5 non ASCII characters in subject
6-10 non ASCII characters in subject
more than 10 non ASCII characters in subject
Attributed Number
0
1
2
3
4
Table A.17 – Recipients of spam email
Number of Recipients
No recipient
1 recipient
2 recipients and more
142
Attributed Number
0
1
2
Table A.18 – Images in spam messages
Number of Images
No image
1 image
2 images
3 images
4 images
5 images
6 images
7 images
8 images
9 images
10-20 images
21-30 images
31-40 images
41- 50 images
51-100 images
101- 500 images
501-1000 images
more than 1000 images
143
Attributes Number
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Bibliographie
[1]
A. A. Benczur, K. Csalogany, T. Sarlos, and M. Uher. Spamrank – fully automatic link
spam detection. In Adversarial Information Retrieval on the Web, 2005.
[2]
M. K. Albertini and R. F. Mde Mello. Formalization of data stream clustering properties
and analysis of algorithms. In The International Conference on Artificial Intelligence
(ICAI). The Steering Committee of The World Congress in Computer Science, Computer
Engineering and Applied Computing, 2011.
[3]
A. Almomani, B. B. Gupta, S. Atawneh, A. Meulenberg, and E. Almomani. A survey
of phishing email filtering techniques. IEEE Communications Surveys and Tutorials,
15(4) :2070–2090, 2013.
[4]
D.S. Anderson, C. Fleizach, S. Savage, and G.M. Voelker. Spamscatter : Characterizing
internet scam hosting infrastructure. In Proceedings of 16th USENIX Security Symposium on USENIX Security Symposium, 2007.
[5]
R. Anderson, C. Barton, R. Böhme, R. Clayton, M. J.G. van Eeten, M. Levi, T. Moore,
and S. Savage. Measuring the cost of cybercrime. In Rainer Böhme, editor, The Economics of Information Security and Privacy, pages 265–300. 2013.
[6]
P. Andritsos, P. Tsaparas, R. Miller, and K.C. Sevcik. In Advances in Database Technology - EDBT 2004, volume 2992 of Lecture Notes in Computer Science, pages 123–146.
2004.
[7]
P. Andritsos, P. Tsaparas, R. Miller, and K.C. Sevcik. Limbo : Scalable clustering of
categorical data. In Advances in Database Technology - EDBT 2004, volume 2992 of
Lecture Notes in Computer Science, pages 123–146. Springer Berlin Heidelberg, 2004.
[8]
I. Androutsopoulos, J. Koutsias, K. V. Chandrinos, and C. D. Spyropoulos. An experimental comparison of naive bayesian and keyword-based anti-spam filtering with
personal e-mail messages. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR, pages
160–167, New York, NY, USA, 2000.
144
[9]
I. Androutsopoulos, J. Koutsias, K. V. Chandrinos, G. Paliouras, and C. D. Spyropoulos.
An evaluation of naive bayesian anti-spam filtering. In Proceedings of the Workshop on
Machine Learning in the New Information Age, 11th European Conference on Machine
Learning, ECML, pages 9–17, 2000.
[10] H.B. Aradhye, G.K. Myers, and J.A. Herson. Image analysis for efficient categorization
of image-based spam e-mail. In Document Analysis and Recognition, 2005. Proceedings.
Eighth International Conference on, pages 914–918 Vol. 2, Aug 2005.
[11] G. Atkinson and A.M. Nevill. Statistical methods for assessing measurement error (reliability) in variables relevant to sports medicine. Sports Medicine, 26(4) :217–238, 1998.
[12] S. Baase and A. V. Gelder. Computer Algorithms : Introduction to Design and Analysis.
Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 3rd edition, 1999.
[13] M. Bailey, E. Cooke, F. Jahanian, X. Yunjing, and M. Karir. A survey of botnet technology and defenses. In Proceedings Conference For Homeland Security, 2009. CATCH
’09. Cybersecurity Applications Technology, pages 299–304, 2009.
[14] C. Beleites, U. Neugebauer, T. Bocklitz, C. Krafft, and J. Popp. Sample size planning
for classification models. Analytica chimica acta, 760 :25–33, 2013.
[15] D. Benavides, S. Segura, and A. Ruiz-Cortés. Automated analysis of feature models 20
years later : A literature review. Inf. Syst., 35(6) :615–636, September 2010.
[16] A. A. Benczur, K. Csalogany, T. Sarlos, and M. Uher. Spamrank–fully automatic link
spam detection work in progress. Proceedings of the first international workshop on
adversarial information retrieval on the web, 2005.
[17] A. Bergholz, G. PaaB, F. Reichartz, S. Strobel, and S. Birlinghoven. Improved phishing
detection using model-based features. In Fifth Conference on Email and Anti-Spam,
CEAS, 2008.
[18] P. Berkhin. A survey of clustering data mining techniques. In Jacob Kogan, Charles
Nicholas, and Marc Teboulle, editors, Grouping Multidimensional Data, pages 25–71.
Springer Berlin Heidelberg, 2006.
[19] J.C. Bezdek and N.R. Pal. Cluster validation with generalized dunn’s indices. In Proceedings of the Second New Zealand International Two-Stream Conference on Artificial
Neural Networks and Expert Systems, pages 190–193, 1995.
[20] B. Biggio, G. Fumera, I. Pillai, and F. Roli. Image spam filtering using visual information,iciap. In Image Analysis and Processing, 14th International Conference on, pages
105–110, Sept 2007.
145
[21] E. Blanzieri and A. Bryl. A survey of learning-based techniques of email spam filtering.
Artificial Intelligence Review, 29(1) :63–92, 2008.
[22] E. Blanzieri and A. Bryl. A survey of learning-based techniques of email spam filtering.
Artif. Intell. Rev., 29(1) :63–92, March 2008.
[23] Janusz A. Brzozowski.
Derivatives of regular expressions.
Journal of the ACM,
11(4) :481–494, October 1964.
[24] S. Buhne, K. Lauenroth, and K. Pohl. Modelling requirements variability across product
lines. In Proceedings of the 13th IEEE International Conference on Requirements Engineering, RE ’05, pages 41–52, Washington, DC, USA, 2005. IEEE Computer Society.
[25] J. Caballero, P. Poosankam, D. Song, and C. Kreibich. Dispatcher : Enabling active
botnet infiltration using automatic protocol reverse-engineering. In In CCS09 : of the
16th ACM conference on Computer and communications security, pages 621–634. ACM,
2009.
[26] P.H. Calais, E. V. P. Douglas, O. G. Dorgival, M. Wagner, H. Cristine, and S.J. Klaus.
A campaign-based characterization of spamming strategies. In the proceedings of 5th
Conference on e-mail and anti-spam (CEAS), 2008.
[27] P.H. Calais, D.E.V Pires, D.O. Guedes, W. Meira, C. Hoepers, and K. Steding-Jessen.
A campaign-based characterization of spamming strategies. In CEAS, 2008.
[28] P.H. Calais Guerra, D.E.V. Pires, M.T. C. Ribeiro, D. Guedes, W. Meira, C. Hoepers,
M. H.P.C Chaves, and K. Steding-Jessen. Spam miner : A platform for detecting and
characterizing spam campaigns. Information Systems Applications, 2009.
[29] J. Carpinter and R. Hunt. Tightening the net : A review of current and next generation
spam filtering tools. Computers and Security, 25(8) :566 – 578, 2006.
[30] X. Carreras, L. Marquez, and J.H. Salgado. Boosting trees for anti-spam email filtering. In Proceedings of RANLP-01, 4th International Conference on Recent Advances in
Natural Language Processing, Tzigov Chark, BG, pages 58–64, 2001.
[31] R. Caruana, N. Karampatziakis, and A. Yessenalina. An empirical evaluation of supervised learning in high dimensions. In Proceedings of the 25th International Conference
on Machine Learning, ICML ’08, pages 96–103, New York, NY, USA, 2008.
[32] R. Caruana and A. Niculescu-Mizil. An empirical comparison of supervised learning
algorithms. In Proceedings of the 23rd International Conference on Machine Learning,
ICML ’06, pages 161–168, NY, USA, 2006. ACM.
146
[33] C.L.P. Chen and C.Y. Zhang. Data-intensive applications, challenges, techniques and
technologies : A survey on big data. Information Sciences, 275 :314 – 347, 2014.
[34] T.C. Chen, T. Stepan, S. Dick, and J. Miller. An anti-phishing system employing diffused
information. ACM Trans. Inf. Syst. Secur., 16(4) :16 :1–16 :31, April 2014.
[35] C. Cho, J. Caballero, C. Grier, V. Paxson, and D. Song. Insights from the inside : a view
of botnet management from infiltration. In Proceedings of the 3rd USENIX conference on
Large-scale exploits and emergent threats : botnets, spyware, worms, and more, LEET’10,
pages 2–2, Berkeley, CA, USA, 2010. USENIX Association.
[36] C. Cisco. Cisco 2015 annual security report. In www.cisco.com, 2015.
[37] E. M. Clarke and J. M. Wing. Formal methods : State of the art and future directions.
ACM Comput. Surv., 28(4) :626–643, December 1996.
[38] T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley-Interscience,
New York, NY, USA, 1991.
[39] K. Czarnecki and U. W. Eisenecker. Generative Programming : Methods, Tools, and
Applications. ACM Press/Addison-Wesley Publishing Co., New York, NY, USA, 2000.
[40] L.F. Da Cruz Nassif and E.R. Hruschka. Document clustering for forensic analysis : An
approach for improving computer inspection. Information Forensics and Security, IEEE
Transactions on, 8(1) :46–54, Jan 2013.
[41] J. Dan, Q. Jianlin, C. Yanyun, and C. Li. Clustering method and its formalization. In
Information Technology and Artificial Intelligence Conference (ITAIC), 6th IEEE Joint
International, volume 1, pages 57–61, Aug 2011.
[42] J. Dean and S. Ghemawat. Mapreduce : Simplified data processing on large clusters.
Commun. ACM, 51(1) :107–113, January 2008.
[43] N. Dershowitz and J.P. Jouannaud. Handbook of theoretical computer science (vol. b).
chapter Rewrite Systems, pages 243–320. MIT Press, Cambridge, MA, USA, 1990.
[44] S. Dinh, T. Azeb, F. Fortin, D. Mouheb, and M. Debbabi. Spam campaign , analysis,
and investigation. Digital Investigation, 12, Supplement 1(0) :S12 – S21, 2015.
[45] D.L. Donoho, A. Flesia, U. Shankar, V. Paxson, J. Coit, and S. Staniford. Multiscale
stepping-stone detection : Detecting pairs of jittered interactive streams by exploiting
maximum tolerable delay. In Recent Advances in Intrusion Detection, volume 2516 of
Lecture Notes in Computer Science, pages 17–35. 2002.
[46] H. Drucker, D. Wu, and V.N. Vapnik. Support vector machines for spam categorization.
IEEE Transactions on Neural Networks, 10(5) :1048 –1054, 1999.
147
[47] Z. Duan, Peng Chen, F. Sanchez, Yingfei Dong, M. Stephenson, and J.M. Barker. Detecting spam zombies by monitoring outgoing messages. IEEE Transactions on Dependable
and Secure Computing, 9(2) :198–210, March 2012.
[48] F. Fdez-Riverola, E. L. Iglesias, F. Díaz, J. R. Méndez, and J. M. Corchado. Applying
lazy learning algorithms to tackle concept drift in spam filtering. Expert Syst. Appl.,
33(1) :36–48, July 2007.
[49] Report. Federal Trade Commission. www.consumer.ftc.gov. In Federal Trade Commission Reprot, 2009.
[50] I. Fette, N. Sadeh, and A. Tomasic. Learning to detect phishing emails. In Proceedings
of the 16th ACM International Conference on World Wide Web, pages 649–656, 2007.
[51] D.H. Fisher. Knowledge acquisition via incremental conceptual clustering. Mach. Learn.,
2(2) :139–172, 1987.
[52] J. François, S. Wang, R. State, and T. Engel. Bottrack : Tracking botnets using netflow
and pagerank. In Jordi Domingo-Pascual, Pietro Manzoni, Sergio Palazzo, Ana Pont,
and Caterina Scoglio, editors, NETWORKING 2011, volume 6640 of Lecture Notes in
Computer Science, pages 1–14. Springer Berlin Heidelberg, 2011.
[53] W. N. Gansterer and D. Pölz. E-mail classification for phishing defense. In Proceedings
of the 31th European Conference on IR Research on Advances in Information Retrieval,
ECIR ’09, pages 449–460, Berlin, Heidelberg, 2009. Springer-Verlag.
[54] H. Gao, J. Hu, C. Wilson, Z. Li, Y. Chen, and B.Y. Zhao. Detecting and characterizing
social spam campaigns. In Proceedings of the 10th ACM annual conference on Internet
measurement, pages 35–47, 2010.
[55] H. Gao, J. Hu, C. Wilson, Z. Li, Y. Chen, and B.Y. Zhao. Detecting and characterizing
social spam campaigns. In Proceedings of the 10th ACM SIGCOMM Conference on
Internet Measurement, IMC ’10, pages 35–47, New York, NY, USA, 2010. ACM.
[56] S. Garcia, J. Luengo, J. A. Saez, V. Lopez, and F. Herrera. A survey of discretization
techniques : Taxonomy and empirical analysis in supervised learning. IEEE Trans. on
Knowl. and Data Eng., 25(4) :734–750, April 2013.
[57] Z. Ghahramani. Unsupervised learning. In Advanced Lectures on Machine Learning,
volume 3176 of Lecture Notes in Computer Science, pages 72–112. Springer Berlin Heidelberg, 2004.
[58] S. Gilpin, S. Nijssen, and I. Davidson. Formalizing hierarchical clustering as integer
linear programming. In Proceedings of the twenty-seventh AAAI conference on artificial
intelligence, 2013.
148
[59] F. Giunchiglia and T. Walsh. A theory of abstraction. Artif. Intell., 57(2-3) :323–389,
October 1992.
[60] M. A. Gluck and J. E. Corter. Information Uncertainty and the Utility of Categories.
In Proceedings of the Seventh Annual Conference of Cognitive Science Society, pages
283–287, 1985.
[61] R. Grinker, S. Lubkemann, and C.B. Steiner. In Perspectives on Africa : A readerin
Culture, History and Representation, pages 618–621, 2012.
[62] J. L. Gross and J. Yellen. Graph Theory and Its Applications, Second Edition (Discrete
Mathematics and Its Applications). Chapman & Hall/CRC, 2005.
[63] R. Hadjidj, M. Debbabi, H. Lounis, F. Iqbal, A. Szporer, and D. Benredjem. Towards an
integrated e-mail forensic analysis framework. Digital Investigation, 5(3–4) :124 – 137,
2009.
[64] M. Halkidi and M. Vazirgiannis. Clustering validity assessment : finding the optimal
partitioning of a data set. In Data Mining, 2001. ICDM 2001, Proceedings IEEE International Conference on, pages 187–194, 2001.
[65] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The
weka data mining software : An update. SIGKDD Explor. Newsl., 11(1) :10–18, 2009.
[66] J. Han, M. Kamber, and J. Pei. Data Mining : Concepts and Techniques. Morgan
Kaufmann Publishers Inc., San Francisco, CA, USA, 3rd edition, 2011.
[67] J. Han, J. Pei, Y. Yin, and R. Mao. Mining frequent patterns without candidate generation : A frequent-pattern tree approach. Data mining and knowledge discovery,
8(1) :53–87, 2004.
[68] U. Hebisch and H.J. Weinert. Semiring- Algebraic Theory and Application in Computer
Science. World Scientific, 1998.
[69] J. Hedley. Jsoup cookbook. http ://jsoup.org/cookbook, 2009.
[70] P. Hell and J. Nesetril. Graphs and homomorphisms. Oxford lecture series in mathematics and its applications. Oxford University Press, Oxford, New York, 2004.
[71] L. Henderson. Crimes of Persuasion : Schemes, Scams, Frauds : how Con Artists Will
Steal Your Savings and Inheritance Through Telemarketing Fraud, Investment Schemes
and Consumer Scams. Coyoto Ridge Press, 2003.
[72] P. Höfner, R. Khedri, and B. Möller. Feature algebra. In Proceedings of the 14th international conference on Formal Methods, FM’06, pages 300–315, Berlin, Heidelberg,
2006. Springer-Verlag.
149
[73] P. Höfner, R. Khédri, and B. Möller. An algebra of product families. Software and
System Modeling, 10(2) :161–182, 2011.
[74] P. Höfner, R. Khedri, and B. Möller. Feature algebra. In Jayadev Misra, Tobias Nipkow,
and Emil Sekerinski, editors, FM 2006 : Formal Methods, volume 4085 of Lecture Notes
in Computer Science, pages 300–315. 2006.
[75] I. Idris, A. Selamat, N. Thanh Nguyen, S. Omatu, O. Krejcar, K. Kuca, and M. Penhaker.
A combined negative selection algorithm–particle swarm optimization for an email spam
detection system. Engineering Applications of Artificial Intelligence, 39 :33 – 44, 2015.
[76] J. Iedemska, G. Stringhini, R.A. Kemmerer, C. Kruegel, and G. Vigna. The tricks of
the trade : What makes spam campaigns successful ? In 35. IEEE Security and Privacy
Workshops, SPW 2014, San Jose, CA, USA, May 17-18, 2014, pages 77–83, 2014.
[77] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering : A review. ACM Comput.
Surv., 31(3) :264–323, 1999.
[78] J.P. John, A. Moshchuk, S.D. Gribble, and A. Krishnamurthy. Studying spamming
botnets using botlab. In Proceedings of the 6th USENIX symposium on Networked
systems design and implementation, NSDI’09, pages 291–306, Berkeley, CA, USA, 2009.
USENIX Association.
[79] J.P. John, A. Moshchuk, S.D. Gribble, and A. Krishnamurthy. Studying spamming
botnets using botlab. In Proceedings of the 6th USENIX symposium on Networked
systems design and implementation, NSDI09, pages 291–306, Berkeley, CA, USA, 2009.
USENIX Association.
[80] I. Kanaris, K. Kanaris, H. Houvardas, and E. Stamatatos. Words versus Character nGrams for Anti-Spam Filtering. International Journal on Artificial Intelligence Tools,
16 :1047–1067, 2007.
[81] K. Kang, S. Cohen, J. Hess, W. Novak, and A. Peterson. Feature-oriented domain
analysis (foda) feasibility study, technical report, 1990.
[82] K. C. Kang, S. Kim, J. Lee, K. Kim, E. Shin, and M. Huh. Form : A feature-oriented
reuse method with domain-specific reference architectures. Ann. Softw. Eng., 5 :143–168,
January 1998.
[83] C. Kanich, C. Kreibich, K. Levchenko, B. Enright, G. M. Voelker, V. Paxson, and S. Savage. Spamalytics : An empirical analysis of spam marketing conversion. In Proceedings
of the 15th ACM Conference on Computer and Communications Security, CCS ’08,
pages 3–14, New York, NY, USA, 2008. ACM.
150
[84] C. Kanich, N. Weavery, D. McCoy, T. Halvorson, C. Kreibichy, K. Levchenko, V. Paxson,
G.M. Voelker, and S. Savage. Show me the money : Characterizing spam-advertised
revenue. In Proceedings of the 20th USENIX Conference on Security, SEC’11, Berkeley,
CA, USA, 2011. USENIX Association.
[85] R. Kerber. Chimerge : Discretization of numeric attributes. In Proceedings of the Tenth
National Conference on Artificial Intelligence, pages 123–128, 1992.
[86] J. Kleinberg. An impossibility theorem for clustering. In Neural Information Processing
Systems Foundation, Inc., pages 446–453. MIT Press, 2002.
[87] C. Kreibich, C. Kanich, K. Levchenko, B. Enright, G.M. Voelker, V. Paxson, and S. Savage. Spamcraft : an inside look at spam campaign orchestration. In Proceedings of the
2nd USENIX conference on Large-scale exploits and emergent threats : botnets, spyware,
worms, and more, LEET09, 2009.
[88] C. Kreibich, C. Kanich, K. Levchenko, B. Enright, G.M. Voelker, V. Paxson, and S. Savage. Spamcraft : An inside look at spam campaign orchestration. In Proceedings of
the 2Nd USENIX Conference on Large-scale Exploits and Emergent Threats : Botnets,
Spyware, Worms, and More, LEET’09, Berkeley, CA, USA, 2009. USENIX Association.
[89] C.C Lai and M.C Tsai. An empirical performance comparison of machine learning
methods for spam e-mail categorization. In Hybrid Intelligent Systems, 2004. HIS ’04.
Fourth International Conference on, pages 44–48, 2004.
[90] C. Laorden, X. Ugarte-Pedrero, I. Santos, B. Sanz, J. Nieves, and P.G. Bringas. Study on
the effectiveness of anomaly detection for spam filtering. Information Sciences, 277 :421
– 444, 2014.
[91] N. Leontiadis. Measuring and analyzing search-redirection attacks in the illicit online
prescription drug trade. In Proceedings of USENIX Security 2011, 2011.
[92] F. Li and M.H. Hsieh. An empirical study of clustering behavior of spammers and groupbased anti-spam strategies. In CEAS 2006 Third Conference on Email and AntiSpam,
pages 27–28, 2006.
[93] H. Li. Minimum entropy clustering and applications to gene expression analysis. In In
Proceedings of IEEE Computational Systems Bioinformatics Conference, pages 142–151,
2004.
[94] X. Li and Z. Fang. Parallel clustering algorithms. Parallel Computing, 11(3) :275 – 290,
1989.
151
[95] J. Liu, D. Batory, and C. Lengauer. Feature oriented refactoring of legacy applications.
In Proceedings of the 28th International Conference on Software Engineering, ICSE ’06,
pages 112–121, New York, NY, USA, 2006. ACM.
[96] J. Liu, Y. Xiao, K. Ghaboosi, H. Deng, and J. Zhang. Botnet : Classification, attacks, detection, tracing, and preventive measures. EURASIP J. Wirel. Commun. Netw.,
2009 :9 :1–9 :11, February 2009.
[97] R. Lopez-Herrejon, D. Batory, and C. Lengauer. A disciplined approach to aspect composition. In Proceedings of the 2006 ACM SIGPLAN Symposium on Partial Evaluation
and Semantics-based Program Manipulation, PEPM ’06, pages 68–77, New York, NY,
USA, 2006. ACM.
[98] C. D. Manning, R. Prabhakar, and H. Schütze. Introduction to Information Retrieval.
Cambridge University Press, New York, NY, USA, 2008.
[99] S. Martin, B. Nelson, A. Sewani, K. Chen, and A. D. Joseph. Analyzing behavioral
features for email classification. In CEAS, 2005.
[100] L. McAfee. Mcafee threats report : 2015. In www.mcafee.com, 2015.
[101] M. McAfee Avert Labs. Mcafee threats report : Third quarter 2013. 2013.
[102] M. Meilǎ. Comparing clusterings : An axiomatic view. In Proceedings of the 22Nd
International Conference on Machine Learning, ICML ’05, pages 577–584, New York,
NY, USA, 2005. ACM.
[103] P. Meyer and A.L. Olteanu. Formalizing and solving the problem of clustering in
{MCDA}. European Journal of Operational Research, 227(3) :494 – 502, 2013.
[104] M. E. J. Newman, S. Forrest, and J. Balthrop. Email networks and the spread of
computer viruses. Phys. Rev. E, 66 :035101, 2002.
[105] B. Panda, J. S. Herbach, S. Basu, and R. J. Bayardo. Planet : Massively parallel learning
of tree ensembles with mapreduce. Proc. VLDB Endow., 2(2) :1426–1437, August 2009.
[106] A. Pathak, F. Qian, Y. C. Hu, Z. M. Mao, and S. Ranjan. Botnet spam campaigns can
be long lasting : Evidence, implications, and analysis. SIGMETRICS Perform. Eval.
Rev., 37(1) :13–24, June 2009.
[107] A. Pitsillidis, K. Levchenko, C. Kreibich, C. Kanich, G.M. Voelker, V. Paxson, N. Weaver, and S. Savage. Botnet judo : Fighting spam with itself. 2010.
[108] C. Pu and S. Webb. Observed trends in spam construction techniques : A case study of
spam evolution. In CEAS, pages 104–112, 2006.
152
[109] J. R. Quinlan. Induction of decision trees. Mach. Learn, pages 81–106, 1986.
[110] S. Radicati. Email statistics report 2013-2017. In www.radiocati.com, 2013.
[111] A. Ramachandran and N. Feamster. Understanding the network-level behavior of spammers. ACM SIGCOMM Computer Communication Review, 36(4) :291–302, 2006.
[112] J. M. Rao and D. H. Reiley. The economics of spam. The Journal of Economic Perspectives, 26(3) :pp. 87–110, 2012.
[113] J.M. Rao and D.H. Reiley. On the spam campaign trail. In The Economics of Spam,
pages 87–110. Journal of Economic Perspectives, Volume 26, Number 3, 2012.
[114] Technical Report. Commtouch technical report. In www.commtouch.com, 2015.
[115] S. Robak and A. Pieczyński. Employment of fuzzy logic in feature diagrams to model
variability in software families. J. Integr. Des. Process Sci., 7(3) :79–94, August 2003.
[116] R. A. Rodríguez-Gómez, G. Maciá-Fernández, and P. García-Teodoro. Survey and taxonomy of botnet research through life-cycle. ACM Comput. Surv., 45(4) :45 :1–45 :33,
August 2013.
[117] L. Rokach. A survey of clustering algorithms. In O. Maimon and L. Rokach, editors,
Data Mining and Knowledge Discovery Handbook, pages 269–298. 2010.
[118] Peter J. Rousseeuw. Silhouettes : A graphical aid to the interpretation and validation
of cluster analysis. Journal of Computational and Applied Mathematics, 20(0) :53 – 65,
1987.
[119] T. S. Guzella and W. M. Caminhas. A review of machine learning approaches to spam
filtering. Expert Systems with Applications, 36(7) :10206 – 10222, 2009.
[120] S. Salvador and P. Chan. Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms. In Proceedings of the 16th IEEE International
Conference on Tools with Artificial Intelligence, ICTAI ’04, pages 576–584, Washington,
DC, USA, 2004. IEEE Computer Society.
[121] A. A. Abu Samra and O. A. Ghanem. Analysis of clustering technique in android
malware detection. In Proceedings of the 2013 Seventh International Conference on
Innovative Mobile and Internet Services in Ubiquitous Computing, IMIS ’13, pages 729–
733, Washington, DC, USA, 2013. IEEE Computer Society.
[122] S. S.C. Silva, R. M.P. Silva, R. C.G. Pinto, and R. M. Salles. Botnets : A survey.
Computer Networks, 57(2) :378 – 403, 2013. Botnet Activity : Analysis, Detection and
Shutdown.
153
[123] A. K. Seewald. An evaluation of naive bayes variants in content-based learning for spam
filtering. Intell. Data Anal., 11(5) :497–524, October 2007.
[124] S. Shalev-Shwartz and B.D. Shai. Understanding machine learning : From theory to
algorithms. In Cambridge University Press, 2014.
[125] C. E. Shannon. A mathematical theory of communication. SIGMOBILE Mob. Comput.
Commun. Rev., 5(1) :3–55, January 2001.
[126] M. Sheikhalishahi, M Mejri, and N. Tawbi. Clustering spam emails into campaigns. In
Olivier Camp, Edgar R. Weippl, Christophe Bidan, and Esma Aïmeur, editors, ICISSP
2015 - Proceedings of the 1st International Conference on Information Systems Security
and Privacy, ESEO, Angers, Loire Valley, France, 9-11 February, 2015., pages 90–97,
February 2015.
[127] M. Sheikhalishahi, M Mejri, and N. Tawbi. On the abstraction of a categorical clustering
algorithm. In Machine Learning and Data Mining in Pattern Recognition - 12th International Conference, MLDM 2016, New York, NY, USA, July 16-21, 2016, Proceedings,
pages 659–675, 2016.
[128] M. Sheikhalishahi, A. Saracino, M Mejri, N. Tawbi, and F. Martinelli. Digitalwaste
sorting : A goal-based, self-learning approach to label spam email campaigns. In Sara
Foresti, editor, Security and Trust Management - 11th International Workshop, STM
2015, Vienna, Austria, September 21-22, 2015, Proceedings, volume 9331 of Lecture
Notes in Computer Science, pages 3–19. Springer, 2015.
[129] M. Sheikhalishahi, A. Saracino, M Mejri, N. Tawbi, and F. Martinelli. Fast and effective
clustering of spam emails based on structural similarity. In Foundations and Practice of
Security - 8th International Symposium, FPS 2015, Clermont-Ferrand, France, October
26-28, 2015, Revised Selected Papers, pages 195–211, 2015.
[130] J. Song, D. Inque, M. Eto, H.C. Kim, and K. Nakao. An empirical study of spam :
Analyzing spam sending systems and malicious web servers. In Proceedings of the 2010
10th IEEE/IPSJ International Symposium on Applications and the Internet, SAINT ’10,
pages 257–260, Washington, DC, USA, 2010. IEEE Computer Society.
[131] J. Song, D. Inque, M. Eto, H.C. Kim, and K. Nakao. A heuristic-based feature selection
method for clustering spam emails. In Proceedings of the 17th international conference
on Neural information processing : theory and algorithms - Volume Part I, ICONIP’10,
pages 290–297, Berlin, Heidelberg, 2010. Springer-Verlag.
[132] J. Song, D. Inque, M. Eto, H.C. Kim, and K. Nakao. O-means : An optimized clustering
method for analyzing spam based attacks. In IEICE Transactions on Fundamentals of
Electronics Communications and Computer Sciences, volume 94, pages 245–254, 2011.
154
[133] B. Stone-Gross, T. Holz, G. Stringhini, and G. Vigna. The underground economy of
spam : A botmaster’s perspective of coordinating large-scale spam campaigns. In Proceedings of the 4th USENIX Conference on Large-scale Exploits and Emergent Threats,
LEET’11, Berkeley, CA, USA, 2011. USENIX Association.
[134] G. Stringhini, O. Hohlfeld, C. Kruegel, and G. Vigna. The harvester, the botmaster, and
the spammer : On the relations between the different actors in the spam landscape. In
Proceedings of the 9th ACM Symposium on Information, Computer and Communications
Security, ASIA CCS ’14, pages 353–364, New York, NY, USA, 2014. ACM.
[135] D. Talia. Parallelism in knowledge discovery techniques. In Proceedings of the 6th
International Conference on Applied Parallel Computing Advanced Scientific Computing,
PARA ’02, pages 127–138, London, UK, 2002.
[136] K. Tillman.
How many internet connections are in the world ? right now.
In
www.blogs.cisco.com, 2013.
[137] A. Topchy, A.K. Jain, and W. Punch. Combining multiple weak clusterings. In ICDM
third Proceedings of the Third IEEE International Conference on Data Mining, pages
331–338, Nov 2003.
[138] K. Tretyakov. Machine learning techniques in spam filtering. 2004.
[139] K. Tretyakov. Machine learning techniques in spam filtering. In Data Mining Problemoriented Seminar, MTAT, volume 3, pages 60–79. Citeseer, 2004.
[140] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features.
In Proceedings of the CVPR IEEE Computer Society Conference on Computer Vision
and Pattern Recognition, volume 1, pages I–511–I–518 vol.1, 2001.
[141] D. Wang, D. Irani, and C. Pu. A study on evolution of email spam over fifteen years. In
Proceedings of the 9th International Conference Conference onCollaborative Computing :
Networking, Applications and Worksharing (Collaboratecom), pages 1–10, Oct 2013.
[142] X. Wang, S. Chen, and S. Jajodia. Network flow watermarking attack on low-latency
anonymous communication systems. In Security and Privacy, 2007. SP ’07. IEEE Symposium on, pages 116–130, 2007.
[143] X.L. Wang and I. Cloete. Learning to classify email : a survey. In Proceedings of 2005
International Conference on Machine Learning and Cybernetics, volume 9, pages 5716–
5719 Vol. 9, Aug 2005.
[144] C. Wei, A. Sprague, G. Warner, and A. Skjellum. Mining spam email to identify common
origins for forensic application. In Proceedings of the 2008 ACM symposium on Applied
computing, SAC ’08, pages 1433–1437, New York, NY, USA, 2008. ACM.
155
[145] F. Weng, Q. Jiang, L. Shi, and N. Wu. An intrusion detection system based on the clustering ensemble. In Anti-counterfeiting, Security, Identification, 2007 IEEE International
Workshop on, pages 121–124, April 2007.
[146] Y. Xie, F. Yu, K. Achan, R. Panigrahy, G. Hulten, and I Osipkov. Spamming botnets :
Signatures and characteristics. 38(4) :171–182, 2008.
[147] R. Xu and D. Wunsch. Survey of clustering algorithms. Proceedings of the IEEE Transactions on Neural Networks, 16(3) :645–678, May 2005.
[148] Y. Yang, X. Guan, and J. You. Clope : A fast and effective clustering algorithm for
transactional data. In Proceedings of the Eighth ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, KDD ’02, pages 682–687, New York, NY,
USA, 2002. ACM.
[149] K. Yoda and H. Etoh. Finding a connection chain for tracing intruders. In Proceedings
of the 6th European Symposium on Research in Computer Security, ESORICS ’00, pages
191–205, London, UK, 2000. Springer-Verlag.
[150] C. Zhang, W.B. Chen, X. Chen, and G. Warner. Revealing common sources of image
spam by unsupervised clustering with visual features. In Proceedings of the 2009 ACM
symposium on Applied Computing, SAC ’09, pages 891–892, New York, NY, USA, 2009.
ACM.
[151] L. Zhang, J. Zhu, and T. Yao. An evaluation of statistical spam filtering techniques.
ACM Transactions on Asian Language Information Processing (TALIP, 3(4) :243–269,
December 2004.
[152] Yao Zhao, Yinglian Xie, Fang Yu, Qifa Ke, Yuan Yu, Yan Chen, and Eliot Gillum.
Botgraph : Large scale spamming botnet detection. In Proceedings of the 6th USENIX
Symposium on Networked Systems Design and Implementation, NSDI’09, pages 321–334,
Berkeley, CA, USA, 2009. USENIX Association.
[153] L. Zhuang, J. Dunagan, D.R. Simon, H. Wang, and J.D. Tygar. Characterizing botnets
from email spam records. In Proceedings of the 1st Usenix Workshop on Large-Scale
Exploits and Emergent Threats, LEET’08, pages 2 :1–2 :9, Berkeley, CA, USA, 2008.
USENIX Association.
156