Efficient Spam Classification by Appropriate

Global Journal of Computer Science and Technology
Software & Data Engineering
Volume 13 Issue 5 Version 1.0 Year 2013
Type: Double Blind Peer Reviewed International Research Journal
Publisher: Global Journals Inc. (USA)
Online ISSN: 0975-4172 & Print ISSN: 0975-4350
Efficient Spam Classification by Appropriate Feature Selection
By Prajakta Ozarkar & Dr. Manasi Patwardhan
Vishwakarma Institute of Technology, India
Abstract - Spam is a key problem in electronic communication, including large-scale email systems
and the growing number of blogs. Currently a lot of research work is performed on automatic
detection of spam emails using classification techniques such as SVM, NB, MLP, KNN, ID3, J48,
Random Tree, etc. For spam dataset it is possible to have large number of training instances. Based
on this fact, we have made use of Random Forest and Partial Decision Trees algorithms to classify
spam vs. non-spam emails. These algorithms outperformed the previously implemented algorithms
in terms of accuracy and time complexity. As a preprocessing step we have used feature selection
methods such as Chi-square, Information gain, Gain ratio, Symmetrical uncertainty, Relief, One R
and Correlation. This allowed us to select subset of relevant, non redundant and most contributing
features to have an added benefit in terms of improvisation in accuracy and reduced time complexity.
Keywords : feature selection, preprocessing, random forest, part.
GJCST-C Classification : H.4.3
Efficient Spam Classification by Appropriate Feature Selection
Strictly as per the compliance and regulations of:
© 2013. Prajakta Ozarkar & Dr. Manasi Patwardhan. This is a research/review paper, distributed under the terms of the Creative
Commons Attribution-Noncommercial 3.0 Unported License http://creativecommons.org/licenses/by-nc/3.0/), permitting all noncommercial use, distribution, and reproduction inany medium, provided the original work is properly cited.
Efficient Spam Classification by Appropriate
Feature Selection
- Spam is a key problem in electronic
communication, including large-scale email systems and the
growing number of blogs. Currently a lot of research work is
performed on automatic detection of spam emails using
classification techniques such as SVM, NB, MLP, KNN, ID3,
J48, Random Tree, etc. For spam dataset it is possible to
have large number of training instances. Based on this fact,
we have made use of Random Forest and Partial Decision
Trees algorithms to classify spam vs. non-spam emails. These
algorithms outperformed the previously implemented
algorithms in terms of accuracy and time complexity. As a
preprocessing step we have used feature selection methods
such as Chi-square, Information gain, Gain ratio, Symmetrical
uncertainty, Relief, One R and Correlation. This allowed us to
select subset of relevant, non redundant and most
contributing features to have an added benefit in terms of
improvisation in accuracy and reduced time complexity.
Keywords : feature selection, preprocessing, random
forest, part.
I
I.
Introduction
n this paper we have studied previous approaches
used for classifying spam and non spam emails by
using distinct classification algorithms. We have also
studied the distinct features extracted for classifier
training and the feature selection algorithms applied to
get rid of irrelevant features and selecting the most
contributing features. After studying the current feature
selection and classification approaches, we have
applied two new classification techniques viz. Random
forests and Partial decision trees along with distinct
feature selection algorithms.
R.Parimala,et.al. [1] Presents a new FS (Feature
selection) technique which is guided by F selector
Package. They have used nine feature selection
techniques such as Correlation based feature selection,
Chi-square, Entropy, Information Gain, Gain Ratio,
Mutual Information, Symmetrical Uncertainty, One R,
Relief and five classification algorithms such as Linear
Discriminant Analysis, Random Forest, Rpart, Naïve
Byes and Support Vector Machine on spam base
dataset. In their evaluation, the results show that filter
methods CFS, Chi-squared, GR, Relief, SU, IG, and one
Author α : Prajakta Ozarkar, Student, Vishwakarma Institute of
Technology, Pune, Maharashtra, India.
E-mail : prajaktaozarkar00 @gmail.com
Author σ : Manasi Patwardhan, Professor, Vishwakarma Institute of
Technology, Pune, Maharashtra, India.
E-mail : manasi.patwardhan @vit.edu
Enables the classifiers to achieve the highest increase in
classification accuracy.They conclude that the
implemented FS can improve the accuracy of Support
vector machine classifiers by performing FS.
In the paper by R. Kishore Kumar, et.al.[2]
spam dataset is analyzed using Tanagra data mining
tool. Initially, feature construction and feature selection
is done to extract the relevant features by using Fisher
filtering, Relief, Runs Filtering, Step disc. Then
classification algorithms such as C4.5, C-PLS, C-RT,
CS-CRT, CS-MC4, CS-SVC, ID, K-NN LDA, Log Reg
TRIRLS, Multilayer Perceptron, Multilogical Logistic
Regression, Naïve Bayes Continuous, PLS-DA, PLSLDA, Rend Tree and SVM are applied over spam base
dataset and cross validation is done for each of these
classifiers. They conclude Fisher filtering and Runs
filtering feature selection algorithms performs better for
many classifiers. The Rend tree classification algorithm
with the relevant features extracted by fisher filtering
produces more than 99% accuracy in spam detection.
W.A. Awad, et.al. [3] reviews machine learning methods
Bayesian classification, k-NN, ANNs, SVMs, Artificial
immune system and Rough sets on the Spam Assassin
spam corpus. They conclude Naïve bayes method has
the highest precision among the six algorithms while the
k-nearest neighbor has the worst precision percentage.
Also, the rough sets method has a very competitive
percentage.
In the work by V.Christina, et.al.[4]employs
supervised machine learning techniques namely C4.5
Decision tree classifier, Multilayer perceptron and Naïve
Bayes classifier. Five features of an e-mail: all (A),
header (H), body (B), subject (S), and body with subject
(B+S), are used to evaluate the performance of four
machine learning algorithms. The training dataset, spam
and legitimate message corpus is generated from the
mails that they have received from their institute mail
server for a period of six months. They conclude
Multilayer Perceptron classifier outperforms other
classifiers and the false positive rate is also very low
compared to other algorithms.
Rafiqul Islam, et.al. [5] have presented an
effective and efficient email classification technique
based on data filtering method. In their testing they have
introduced an innovative filtering technique using
instance selection method (ISM) to reduce the pointless
data instances from training model and then classify the
test data. In their model, tokenization and domain
© 2013 Global Journals Inc. (US)
Year 2 013
Abstract
σ
49
Global Journal of Computer Science and Technology ( C
D ) Volume XIII Issue V Version I
α
Prajakta Ozarkar & Dr. Manasi Patwardhan
Year 2 013
Efficient Spam Classification by Appropriate Feature Selection
Global Journal of Computer Science and Technology ( C
D ) Volume XIII Issue V Version I
50
2
specific feature selection methods are used for feature
extraction. The behavioral features are also included for
improving performance, especially for reducing false
positive (FP) problems. The behavioral features include
the frequency of sending/receiving emails, email
attachment, type of attachment, and size of attachment
and length of the email. In their experiment, they have
tested five base classifiers Naive Bayes, SVM, IB1,
Decision Table and Random Forest on 6 different
datasets. They also have tested adaptive boosting
(AdaboostM1) as meta-classifier on top of base
classifiers. They have achieved overall classification
accuracy above 97%.
A comparative analysis is performed by Ms.D
KarthikaRenuka, et.al. [6], for the classification
techniques such as MLP, J48 and Naïve Bayesian, for
classifying spam messages from e-mail using WEKA
tool. The dataset gathered from UCI repository had
2788 legitimate and 1813 spam emails received during
a period of several months. Using this dataset as a
training dataset, models are built for classification
algorithms. The study reveals that the same classifier
performed dissimilarly when run on the same dataset
but using different software tools. Thus, from all
perspectives MLP is top performer in all cases and thus,
can be deemed consistent.
Following table summarizes all the previous
classification approaches enlisted above and provides a
comparison in terms of % accuracy they have achieved
with the application of a specific feature selection
algorithm.
Reference
R. Kishore Kumar,
et.al.[2]
Table 1 : Comparison of previous approaches of spam
detection
Reference
Classifier Used and
features %
Feature
Selection
Acc (%)
R.Parimala,et.al.
[1]
SVM (100%)
SVM (16%)
SVM (70%)
SVM (70%)
SVM (70%)
SVM (70%)
SVM (70%)
SVM (70%)
SVM( 32%)
SVM (12%)
SVM (16%)
SVM (21%)
SVM (7%)
C-PLS
C-RT
CS-CRT
CS-MC4
CS-SVC
ID3
KNN
LDA
LogReg TRI
CFS
Chi
IG
GR
SU
oneR
Relief
Lda
Rpart
SVM
RF
NB
Fisher
Fisher
Fisher
Fisher
Fisher
Fisher
Fisher
Fisher
Fisher
93
91.44
93.00
93.00
93.39
93.33
92.65
93.15
91.90
90.51
89.95
91.23
80.00
99.8976
99.9465
99.9465
99.9415
99.9685
99.9137
99.9391
99.8861
99.8552
R. Kishore Kumar,
et.al.[2]
W. A. Awad,
et.al.[3]
V.Christina, et.al.[4]
RafiqulIslam,et.al
[5]
Ms.DKarthikaRenuk
a,et.al [6]
II.
Classifier Used and
features %
MLP
Multilogical LR
NBC
PLS-DA
PLD-LDA
Rnd Tree
SVM
C4.5
C-PLS
C-RT
CS-CRT
CS-MC4
CS-SVC
ID3
KNN
LDA
LogReg TRI
Feature
Selection
Fisher
Fisher
Fisher
Fisher
Fisher
Fisher
Fisher
Relief
Relief
Relief
Relief
Relief
Relief
Relief
Relief
Relief
Relief
Acc (%)
99.9459
99.9311
99.8865
99.8752
99.8757
99.9911
99.9070
99.9487
99.8537
99.9261
99.9261
99.9324
99.8794
99.895
99.9176
99.8481
99.8179
MLP
Multilogical LR
NBC
PLS-DA
PLD-LDA
Rnd Tree
SVM
C4.5
C-PLS
C-RT
CS-CRT
CS-MC4
CS-SVC
ID3
KNN
LDA
MLP
LogReg TRI
Multilogical LR
NBC
PLS-DA
PLD-LDA
Rnd Tree
SVM
C4.5
C-PLS
C-RT
CS-CRT
CS-MC4
CS-SVC
ID3
KNN
LDA
LogReg TRI
MLP
Multilogical LR
NBC
PLS-DA
PLD-LDA
Rnd Tree
SVM
NBC
SVM
KNN
ANN
AIS
Rough Sets
NBC
J48
MLP
NB
SMO
IB1
DT
RF
MLP
J48
NBC
Relief
Relief
Relief
Relief
Relief
Relief
Relief
Runs
Runs
Runs
Runs
Runs
Runs
Runs
Runs
Runs
Runs
Runs
Runs
Runs
Runs
Runs
Runs
Runs
StepDisc
StepDisc
StepDisc
StepDisc
StepDisc
StepDisc
StepDisc
StepDisc
StepDisc
StepDisc
StepDisc
StepDisc
StepDisc
StepDisc
StepDisc
StepDisc
StepDisc
-
99.9185
99.8883
99.8587
99.8474
99.8476
99.9676
99.8639
99.9633
99.9102
99.9404
99.9404
99.9615
99.9233
99.9137
99.9404
99.8887
99.9607
99.8611
99.9313
99.8874
99.8879
99.8879
99.9883
99.9076
99.9633
99.9081
99.9341
99.9341
99.9604
99.9218
99.9105
99.935
99.8881
99.8587
99.9481
99.9294
99.8829
99.8826
99.8829
99.99
99.905
99.46
96.9
96.2
96.83
96.23
97.42
98.6
96.6
99.3
92.3
96.4
95.8
95.9
96.1
93
92
89
Proposed Work
After a detailed review of the existing
techniques used for spam detection, in this section we
are illustrating the methodology and techniques we
used for spam mail detection.
© 2013 Global Journals Inc. (US)
Figure 1 : Stages of Spam Email Classification
is in a separate text file. The body of an email contains
the actual information. This information needs to be
extracted before running a filter process by means of
preprocessing. The purpose for preprocessing is to
transform messages in mail into a uniform format that
can be understood by the learning algorithm. Following
are the steps involved in preprocessing:
1. Feature extraction (Tokenization): Extracting
features from e-mail in to a vector space.
2. Stemming: Stemming is a process for removing the
commoner morphological and in-flexional endings
from words in English.
3. Stop word removal: Removal of non-informative
words.
4. Noise removal: Removal of obscure text or symbols
from features.
5. Representation: tf-idf is a statistical measure used
to calculate how significant a word is to a document
in a feature corpus. Word frequency is established
by term frequency (tf), number of times the word
appears in the message yields the significance of
the word to the document. The term frequency then
is multiplied with inverse document frequency (idf)
which measures the frequency of the word
occurring in all messages
IV. Feature Ranking and Subset
Selection
In the following subsections we discuss the
basic concepts related to our work. It includes a brief
background
on
feature
ranking
techniques,
classification techniques and results.
III.
Data Set
The dataset used for our experiment is spam
base [13].The last column of 'spam base. Data' denotes
whether the e-mail was considered spam (1) or not (0).
Most of the attributes indicate the frequency of spam
related term occurrences. The first 48 set of attributes
(1–48) give tf-idf (term frequency and inverse document
frequency) values for spam related words, whereas the
next 6 attributes (49-54) provide tf-idf values for spam
related terms. The run-length attributes (55-57) measure
the length of sequences of consecutive capital letters,
capital_ run_ length_ average, capital_ run_ length_
longest and capital_ run_ length_ total. Thus, our
dataset has in total 57 attributes serving as an input
features for spam detection and the last attribute
represent the class (spam/non-spam).
We have also used one public dataset Enron
[20].The “preprocessed” subdirectory contains the
messages in the preprocessed format. Each message
From the above defined feature vector of total
58 features, we use feature ranking and selection
algorithms to select the subset of features. We rank the
given set of features using the following distinct
approaches.
a) Chisquare
Chi-squared hypothesis tests may be
performed on contingency tables in order to decide
whether or not effects are present. Effects in a
contingency table are defined as relationships between
the row and column variables; that is, are the levels of
the row variable differentially distributed over levels of
the column variables. Significance in this hypothesis test
means that interpretation of the cell frequencies is
warranted. Non-significance means that any differences
in cell frequencies could be explained by chance.
Hypothesis tests on contingency tables are based on a
statistic called Chi-square [8].

2
=
(𝑂 − 𝐸)2
𝐸
Where, O – Observed cell frequency,
E –Expected cell frequency.
b) Information Gain
Information Gain is the expected reduction in
entropy caused by partitioning the examples according
to a given attribute. Information gain is a symmetrical
© 2013 Global Journals Inc. (US)
51
Global Journal of Computer Science and Technology ( C
D ) Volume XIII Issue V Version I
Figure 1 shows the process we have used for
spam mail identification and how it is used in
conjunction with a machine learning scheme. Feature
ranking techniques such as Chi-square, Information
gain, Gain ratio, Symmetrical uncertainty, Relief, One
and Correlation are applied to a copy of the training
data. After the feature selection subset with the highest
merit is used to reduce the dimensionality of both the
original training data and the testing data. Both reduced
datasets may then be passed to a machine learning
scheme for training and testing. Results are obtained by
using Random Forest and Part classification
techniques.
Year 2 013
Efficient Spam Classification by Appropriate Feature Selection
Efficient Spam Classification by Appropriate Feature Selection
measure that is, the amount of information gained about
Y after observing X is equal to the amount of information
gained about X after observing Y. The entropy of Y is
given by [9]
Year 2 013
𝐻𝐻 𝑌𝑌 = − 𝑃𝑃 𝑌𝑌 𝑙𝑙𝑜𝑜𝑔𝑔2(𝑃𝑃 𝑌𝑌 )𝑦𝑦∈𝑌𝑌
Global Journal of Computer Science and Technology ( C
D ) Volume XIII Issue V Version I
52
2
If the observed values of Y in the training data
are partitioned according to the values of a second
feature X, and the entropy of Y with respect to the
partitions induced by X is less than the entropy of Y prior
to partitioning, then there is a relationship between
features Y and X. Equation gives the entropy of Y after
observing X
𝐻𝐻 𝑌𝑌 = − (𝑥𝑥) 𝑃𝑃 𝑦𝑦 𝑥𝑥 𝑙𝑙𝑜𝑜𝑔𝑔2(𝑃𝑃 𝑦𝑦 𝑥𝑥 )𝑦𝑦∈𝑌𝑌𝑥𝑥∈𝑋𝑋
The amount by which the entropy of Y
decreases reflects additional information about Y
provided by X and is called the information gain or
alternatively, mutual information [9]. Information gain is
given by
c) Gain Ratio
𝐺𝐺𝑎𝑎𝑖𝑖𝑛𝑛=𝐻𝐻 𝑌𝑌 + 𝐻𝐻 𝑌𝑌 𝑋𝑋
= 𝐻𝐻 𝑋𝑋 + 𝐻𝐻 𝑋𝑋 𝑌𝑌
=H Y +H X −(𝑋𝑋,𝑌𝑌)
The various selection criteria have been
compared empirically in a series of experiments. When
all attributes are binary, the gain ratio criterion has been
found to give considerably smaller decision trees. When
the task includes attributes with large numbers of
values, the subset criterion gives smaller decision trees
that also have better predictive performance, but can
require much more computation. However, when these
many-valued attributes are augmented by redundant
attributes which contain the same information at a lower
level of detail, the gain ratio criterion gives decision
trees with the greatest predictive accuracy. All in all, it
suggests that the gain ratio criterion does pick a good
attribute for the root of the tree [12].
𝐺𝐺𝑎𝑎𝑖𝑖𝑛𝑛 𝑅𝑅𝑎𝑎𝑡𝑡𝑖𝑖𝑜𝑜=𝐻𝐻 𝑌𝑌 +𝐻𝐻 𝑋𝑋 −𝐻𝐻(𝑌𝑌,𝑋𝑋)𝐻𝐻(𝑋𝑋)
d) Symmetrical Uncertainty
Information gain is a symmetrical measure that
is, the amount of information gained about Y after
observing X is equal to the amount of information
gained about X after observing Y. Symmetry is a
desirable property for a measure of feature-feature inter
correlation to have. Unfortunately, information gain is
biased in favor of features with more values.
Symmetrical uncertainty compensates for information
gain’s bias toward attributes with more values and
normalizes its value to the range [0, 1] [9]:
𝑆𝑆𝑦𝑦𝑚𝑚𝑚𝑚𝑒𝑒𝑡𝑡𝑟𝑟𝑖𝑖𝑐𝑐𝑎𝑎𝑙𝑙 𝑈𝑈𝑛𝑛𝑐𝑐𝑒𝑒𝑟𝑟𝑡𝑡𝑎𝑎𝑖𝑖𝑛𝑛𝑡𝑡𝑦𝑦 𝐶𝐶𝑜𝑜𝑒𝑒𝑓𝑓𝑓𝑓= 2.0×𝐺𝐺𝑎𝑎𝑖𝑖𝑛𝑛𝐻𝐻 𝑌𝑌 +(𝑋𝑋)
e) Relief
Relief [10] is a feature weighting algorithm that
is sensitive to feature interactions. Relief attempts to
© 2013 Global Journals Inc. (US)
approximate the following difference of probabilities for
the weight of a feature X [9]:
𝑊𝑊𝑋𝑋=𝑃𝑃( 𝑑𝑑𝑖𝑖𝑓𝑓𝑓𝑓𝑒𝑒𝑟𝑟𝑒𝑒𝑛𝑛𝑡𝑡 𝑣𝑣𝑎𝑎𝑙𝑙𝑢𝑢𝑒𝑒 𝑜𝑜𝑓𝑓 𝑋𝑋
𝑛𝑛𝑒𝑒𝑎𝑎𝑟𝑟𝑒𝑒𝑠𝑠𝑡𝑡 𝑖𝑖𝑛𝑛𝑠𝑠𝑡𝑡𝑎𝑎𝑛𝑛𝑐𝑐𝑒𝑒 𝑜𝑜𝑓𝑓 𝑑𝑑𝑖𝑖𝑓𝑓𝑓𝑓𝑒𝑒𝑟𝑟𝑒𝑒𝑛𝑛𝑡𝑡 𝑐𝑐𝑙𝑙𝑎𝑎𝑠𝑠𝑠𝑠)
− 𝑃𝑃 (𝑑𝑑𝑖𝑖𝑓𝑓𝑓𝑓𝑒𝑒𝑟𝑟𝑒𝑒𝑛𝑛𝑡𝑡 𝑣𝑣𝑎𝑎𝑙𝑙𝑢𝑢𝑒𝑒 𝑜𝑜𝑓𝑓 𝑋𝑋
𝑛𝑛𝑒𝑒𝑎𝑎𝑟𝑟𝑒𝑒𝑠𝑠𝑡𝑡 𝑖𝑖𝑛𝑛𝑠𝑠𝑡𝑡𝑎𝑎𝑛𝑛𝑐𝑐𝑒𝑒 𝑜𝑜𝑓𝑓 𝑠𝑠𝑎𝑎𝑚𝑚𝑒𝑒 𝑐𝑐𝑙𝑙𝑎𝑎𝑠𝑠𝑠𝑠)
By removing the context sensitivity provided by
the “nearest instance” condition, attributes are treated
as independent of one another;
𝑅𝑅𝑒𝑒𝑙𝑙𝑖𝑖𝑒𝑒𝑓𝑓𝑋𝑋= 𝑃𝑃 (𝑑𝑑𝑖𝑖𝑓𝑓𝑓𝑓𝑒𝑒𝑟𝑟𝑒𝑒𝑛𝑛𝑡𝑡 𝑣𝑣𝑎𝑎𝑙𝑙𝑢𝑢𝑒𝑒 𝑜𝑜𝑓𝑓 𝑋𝑋
𝑑𝑑𝑖𝑖𝑓𝑓𝑓𝑓𝑒𝑒𝑟𝑟𝑒𝑒𝑛𝑛𝑡𝑡 𝑐𝑐𝑙𝑙𝑎𝑎𝑠𝑠𝑠𝑠)
− 𝑃𝑃( 𝑑𝑑𝑖𝑖𝑓𝑓𝑓𝑓𝑒𝑒𝑟𝑟𝑒𝑒𝑛𝑛𝑡𝑡 𝑣𝑣𝑎𝑎𝑙𝑙𝑢𝑢𝑒𝑒 𝑜𝑜𝑓𝑓 𝑋𝑋
𝑠𝑠𝑎𝑎𝑚𝑚𝑒𝑒 𝑐𝑐𝑙𝑙𝑎𝑎𝑠𝑠𝑠𝑠)
Which can be reformulated as
𝑅𝑅𝑒𝑒𝑙𝑙𝑖𝑖𝑒𝑒𝑓𝑓𝑥𝑥= 𝐺𝐺𝑖𝑖𝑛𝑛𝑖𝑖′× 𝑝𝑝 𝑥𝑥 2𝑥𝑥∈𝑋𝑋 1− 𝑝𝑝 𝑐𝑐 2𝑐𝑐∈𝐶𝐶 𝑝𝑝 𝑐𝑐 2𝑐𝑐∈𝐶𝐶
Where, C is the class variable and
𝐺𝐺𝑖𝑖𝑛𝑛𝑖𝑖′= 𝑝𝑝 𝑐𝑐 (1−𝑝𝑝) ∈𝐶𝐶 − 𝑝𝑝 𝑥𝑥 2 𝑝𝑝 𝑥𝑥 2𝑥𝑥∈𝑋𝑋 𝑝𝑝 𝑐𝑐 𝑥𝑥 (1−𝑝𝑝 𝑐𝑐 𝑥𝑥)
𝑐𝑐∈𝐶𝐶 𝑥𝑥∈𝑋𝑋 6.
f)
OneR
Like other empirical learning methods, 1R [11]
takes as input a set of examples, each with several
attributes and a class. The aim is to infer a rule that
predicts the class given the values of the attributes. The
1R algorithm chooses the most informative single
attribute and bases the rule on this attribute alone. The
basic idea is:
For each attribute a, form a rule as follows:
For each value v from the domain of a,
Select the set of instances where a has value v.
Let c be the most frequent class in that set.
Add the following clause to the rule fora:
if a has value v then the class is c
Calculate the classification accuracy of this
rule. Use the rule with the highest classification
accuracy. The algorithm assumes that the attributes are
discrete. If not, then they must be discretized.
g) Correlation
Feature selection for classification tasks in
machine learning can be accomplished on the basis of
correlation between features, and that such a feature
selection procedure can be beneficial to common
machine learning algorithms [9]. Features are relevant if
their values vary systematically with category
membership. In other words, a feature is useful if it is
correlated with or predictive of the class; otherwise it is
irrelevant. A good feature subset is one that contains
features highly correlated with (predictive of) the class,
yet uncorrelated with (not predictive of) each other. The
acceptance of a feature will depend on the extent to
which it predicts classes in areas of the instance space
not already predicted by other features. Correlation
3. Each tree is grown to the largest extent possible.
There is no pruning.
𝑀𝑀𝑠𝑠=𝑘𝑘𝑟𝑟𝑐𝑐𝑓𝑓 𝑘𝑘+𝑘𝑘 𝑘𝑘−1 𝑟𝑟𝑓𝑓𝑓𝑓
Random Forest is an ensemble of trees. In our
implementation of random forest we have selected a
vector of 4 features (randomly selected), to build each
tree in a forest of 10 random trees. Tree grows to its
maximum depth as that argument is set to zero, which
indicates unlimited depth. By using bagging and voting
techniques classification is being done. For example, a
sample part of the output of the forest (very small
portion of the forest) is as shown below:
Where, - the heuristic “merit” of a feature subset
S containing k features, 𝑟𝑟𝑐𝑐𝑓𝑓-the mean feature-class
correlation, 𝑟𝑟𝑓𝑓𝑓𝑓 -the average feature-feature intercorrelation.
Feature ranking further help us to 1. Remove irrelevant features, which might be
misleading the classifier decreasing the classifier
interpretability by reducing generalization by
increasing over fitting.
2. Remove redundant features, which provide no
additional information than the other set of features,
unnecessarily decreasing the efficiency of the
classifier.
3. Selecting high rank features, which may not affect
much as far as improving precision and recall is
concerned; but reduces time complexity drastically.
Selection of such high rank features reduces the
dimensionality feature space of the domain. It
speeds up the classifier there of improving the
performance and increasing the comprehensibility of
the classification result.
We have considered 87%, 77% and 70% of the
features; wherein there is a performance improvement
in 70% feature consideration.
IV.
Classification Method
Based on the assumption that the given dataset
has enough number of the training instances we have
chosen the following two classification algorithms. The
algorithms work well based on the fact that the dataset
is of good quality.
a) Random
Forest Random Forests [14] are a combination
of tree predictors such that each tree depends on the
values of a random vector sampled independently and
with the same distribution for all trees in the forest. The
generalization error for forests converges a.s. to a limit
as the number of trees in the forest becomes large. The
generalization error of a forest of tree classifiers
depends on the strength of the individual trees in the
forest and the correlation between them. Each tree is
grown as follows:
1. If the number of cases in the training set is N,
sample N cases at random - but with replacement,
from the original data. This sample will be the
training set for growing the tree.
2. If there are M input variables, a number m<<M is
specified such that at each node, m variables are
selected at random out of the M and the best split
on these m is used to split the node. The value of m
is held constant during the forest growing.
Total Random forest Trees: 10
Numbers of random features: 4
Out of bag error: 0.1092391304347826
All the trees in the forest:
RandomTree
==========
word_freq_hpl < 0.07
| char_freq_$ < 0.03
| | word_freq_you < 0.12
| | | word_freq_hp < 0.02
| | | | char_freq_! < 0.01
| | | | | word_freq_3d < 9.87
| | | | | | word_freq_000 < 0.08
| | | | | | | char_freq_( < 0.04
| | | | | | | | word_freq_meeting < 0.85
| | | | | | | | | word_freq_remove < 2.27
| | | | | | | | | | word_freq_free < 6.47
| | | | | | | | | | | word_freq_will < 0.17
| | | | | | | | | | | | word_freq_pm < 0.42
| | | | | | | | | | | | | word_freq_all < 0.21
| | | | | | | | | | | | | | word_freq_mail < 2.96
| | | | | | | | | | | | | | | word_freq_re < 5.4
| | | | | | | | | | | | | | | |
word_freq_technology < 1.43
| | | | | | | | | | | | | | | | |
capital_run_length_total < 18.5
| | | | | | | | | | | | | | | | | | word_freq_re <
0.68
| | | | | | | | | | | | | | | | | | |
word_freq_make < 1.39
| | | | | | | | | | | | | | | | | | | |
capital_run_length_total < 10.5 : 0 (218/0)
| | | | | | | | | | | | | | | | | | | |
capital_run_length_total >= 10.5
| | | | | | | | | | | | | | | | | | | | |
word_freq_internet < 0.89
| | | | | | | | | | | | | | | | | | | | | |
word_freq_people < 1.47
| | | | | | | | | | | | | | | | | | | | | | |
word_freq_data < 3.7
| | | | | | | | | | | | | | | | | | | | | | |
| word_freq_edu < 2.38
| | | | | | | | | | | | | | | | | | | | | | |
| | char_freq_[ < 0.59
| | | | | | | | | | | | | | | | | | | | | | |
| | | char_freq_; < 0.16
| | | | | | | | | | | | | | | | | | | | | | |
| | | | capital_run_length_total < 11.5
| | | | | | | | | | | | | | | | | | | | | | |
| | | | | word_freq_credit < 9.09 : 0 (1/0)
This is the case when 100%features have
selected for training model, accordingly the root node of
each tree changes.
© 2013 Global Journals Inc. (US)
53
Global Journal of Computer Science and Technology ( C
D ) Volume XIII Issue V Version I
based feature selection feature subset evaluation
function [9]:
Year 2 013
Efficient Spam Classification by Appropriate Feature Selection
Efficient Spam Classification by Appropriate Feature Selection
Year 2 013
b) Partial Decision Tree
Global Journal of Computer Science and Technology ( C
D ) Volume XIII Issue V Version I
54
2
Rule learners are prominent representatives of
supervised machine learning approaches. Basically, this
type of learner tries to induce a set of rules for a
collection of training instances. These rules are then
applied on the test instances for classification purposes.
Two well-known members of the family of rule-learners
are C4.5 and RIPPER. C4.5 [16], for instance, generates
an unprimed decision tree and transforms this tree into
a set of rules. For each path from the root node to a leaf
a rule is generated. Then, each rule is simplified
separately followed by a rule-ranking strategy. Finally,
the algorithm deletes rules from the rule set as long as
the rule set’s error rate on the training instances
decreases. RIPPER [17] implements a divide and
conquers strategy to rule induction. Only one rule is
generated at a time and the instances from a training
set covered by this rule are removed. It iteratively
derives new rules for the remaining instances of the
training set.
PART (Partial Decision Trees) adopts the
divide-and-conquer strategy of RIPPER [17] and
combines it with the decision tree approach of C4.5
[16]. PART generates a set of rules according to the
divide-and-conquer strategy, removes all instances from
the training collection that are covered by this rule and
proceeds recursively until no instance remains. To
generate a single rule, PART builds a partial decision
tree for the current set of instances and chooses the leaf
with the largest coverage as the new rule. For example,
following is the way of rule formation in our
implementation of PART and some of the rules are as
shown below:
Rule 1:
word_freq_remove> 0.0 AND
char_freq_! > 0.049 AND
word_freq_edu<= 0.06: 1 (Instances: 490 and
Incorrect: 7)
Now, after Rule1 the next set of rules are formed
excluding 490 instances from the 4601 total instances of
spambase.
Rule 2:
char_freq_$ > 0.058 AND
word_freq_hp<=0.4 AND
capital_run_length_longest> 9.0 AND
word_freq_1999 <= 0.0 AND
word_freq_edu<= 0.08 AND
char_freq_! > 0.107: 1 (Instances: 334 and Incorrect: 2)
Next set of rules is formed on remaining 3777
instnaces of spambase.
Rule 3:
word_freq_money<= 0.03 AND
word_freq_000<=0.25AND
word_freq_remove<= 0.26 AND
word_freq_free<=0.19AND
© 2013 Global Journals Inc. (US)
word_freq_font<= 0.12 AND
char_freq_! <= 0.391 AND
char_freq_$<=0.172 AND
word_freq_george> 0.0: 0 (Instances: 553 and
Incorrect: 0)
Total 42 rules are formulated when training.
V. Results
a) Smapbase Results
The dataset spambase was taken from UCI
machine learning repository [13]. Spambase dataset
contains 4601 instances and 58 attributes. 1 - 57
continuous attributes and 1 nominal class label. The
email spam classification has been implemented in
Eclipse. Eclipse considered by many to be the best
Java development tool available. Feature ranking and
feature selection is done by using the methods such as
Chi-square, Information gain, Gain ratio, Relief, OneR,
Correlation as a preprocessing step so as to select
feature subset for building the learning model.
Classification algorithms are from decision tree
family, viz, Random Forest and Partial Decision Trees.
Random forest is an effective tool in prediction.
Because of the law of large numbers they do not over
fit. Random inputs and random features produce good
results in classification-less so in regression. For the
larger data sets, it seems that significantly lower error
rates are possible [14]. Feature space can be reduced
by the magnitude of 10 while achieving similar
classification results. For example, it takes about 2,000
features to achieve similar accuracies as those obtained
with 149 PART features [15].
As a part of our implementation, we have
divided the dataset into two parts. 80% of the dataset is
used for training purpose and 20% for the testing
purpose. After preprocessing step top 87%, 77% and
70% features are considered while building training
model and testing because there is a significant
performance improvement. Prediction accuracy,
correctly classified instances, incorrectly classified
instances, confusion matrix and time complexity are
used as performance measures of the system.
More than 99% prediction accuracy is
achieved by Random forest with all the seven feature
selection methods in consideration; whereas 97%
prediction accuracy is achieved by PART with almost all
the seven feature selection methods while training the
model. Training and testing results, when 100% features
have considered are given in Table 2.
Table 2 : Results of 100% feature selection
Classifier
Random
Forest
PART
Training
Testing
Time
(ms)
99.918
94.354
1540
96.416
92.291
4938
Efficient Spam Classification by Appropriate Feature Selection
Table 3 : Training Results
FS
(%)
FS
87%
77%
70%
RF
Acc
(%)
Time
(ms)
Part
Acc
(%)
Time
(ms)
Chi
Infogain
Gainratio
Relief
SU
OneR
Corr
99.891
99.837
99.918
99.891
99.918
99.918
99.728
1349
1330
1386
1397
1367
1470
1153
98.234
98.505
98.315
96.63
98.505
96.902
95.027
3797
3080
3611
3467
3124
4727
847
Chi
Infogain
Gainratio
Relief
SU
OneR
Corr
99.918
99.891
99.918
99.864
99.891
99.891
99.728
1373
1498
1604
1367
1294
1406
1145
97.283
97.147
97.006
97.799
97.147
94.973
95.027
2701
3131
4007
3829
2867
3469
835
Chi
Infogain
Gainratio
Relief
SU
OneR
Corr
99.891
99.918
99.864
99.81
99.918
99.918
99.728
1282
1314
1383
1428
1276
1442
1152
97.092
97.092
96.821
96.658
97.092
95.245
95.027
2437
2409
2642
2855
2394
2528
845
Table 4 : Testing Results
FS
(%)
FS
RF Acc
(%)
Part Acc
(%)
87%
Chi
Infogain
Gainratio
Relief
SU
OneR
FS
94.788
94.137
93.594
95.114
93.16
92.834
RF Acc
(%)
92.291
93.16
94.137
93.185
93.16
89.902
Part Acc
(%)
Corr
Chi
Infogain
Gainratio
Relief
SU
OneR
Corr
93.051
93.485
94.028
94.245
93.485
94.028
93.16
93.051
92.942
92.508
93.051
92.291
92.617
93.051
91.531
92.942
FS
(%)
77%
Chi
Infogain
Gainratio
Relief
SU
OneR
Corr
94.245
94.68
94.028
93.811
94.137
93.16
93.051
93.268
93.268
94.463
91.965
93.268
89.794
92.942
From the results above, it can be observed that
for Random Forest, after using 87% of the feature set
extracted the training accuracy is (96.012%) whereas
the computation time reduced by 51.574% (from
9466ms – to 4584ms). This shows that the remaining
13% features were not contributing towards the
classification.
Also, it can be observed that for Part, after
using 87%, 77% of the feature set extracted the training
accuracy is increased. There is a significant
improvement in 87% feature selection by 1% and
computation time is reduced by 67.879% (from 18558
ms – to 5961ms). This shows that not only the
remaining 30% features were redundant but also they
were misleading the classification.
VI.
Enron Results
More than 96% prediction accuracy is achieved
by Random forest with all the seven feature selection
methods in consideration; whereas more than 95%
prediction accuracy is achieved by PART with almost all
the seven feature selection methods while training the
model. Training and testing results, when 100% features
have considered are given in Table 5.
Table 5 : Results of 100% feature selection
Classifier
Training
Testing
Random
Forest
PART
96.181
93.623
Time
(ms)
9466
95.093
91.787
18558
Both training and testing results after feature
ranking and subset selection are shown in the Table 6
and Table 7.
Table 6 : Training Results
FS
(%)
FS
RF
Acc
(%)
Time
(ms)
Part
Acc
(%)
Time
(ms)
87%
Chi
Infogain
Gainratio
Relief
SU
OneR
Corr
96.012
96.012
96.012
96.012
96.012
96.012
96.012
4210
4106
4584
4070
4170
4085
4147
94.634
94.634
94.634
94.634
94.634
94.634
94.634
5961
5839
5791
5806
5854
5856
5821
© 2013 Global Journals Inc. (US)
Year 2 013
70%
55
Global Journal of Computer Science and Technology ( C
D ) Volume XIII Issue V Version I
Both training results and testing results on
spambase dataset after feature ranking and subset
selection are shown in the Table 3 and Table 4.
Efficient Spam Classification by Appropriate Feature Selection
Year 2 013
Table 7 : Testing Results
Global Journal of Computer Science and Technology ( C
D ) Volume XIII Issue V Version I
56
2
FS
(%)
FS
87%
Chi
Infogain
Gainratio
Relief
SU
OneR
Corr
RF Acc
(%)
Part Acc
(%)
93.43
93.43
93.43
93.43
93.43
93.43
93.43
90.725
90.725
90.725
90.725
90.725
90.725
90.725
From the results above, it can be observed that
for Random Forest, after using 87% of the feature set
extracted the training accuracy is (96.012%) whereas
the computation time reduced by 51.574% (from
9466ms – to 4584ms). This shows that the remaining
13% features were not contributing towards the
classification. Also, it can be observed that for Part, after
using 87%, 77% of the feature set extracted the training
accuracy is increased. There is a significant
improvement in 87% feature selection by 1% and
computation time is reduced by 67.879% (from 18558
ms – to 5961ms). This shows that not only the
remaining 30% features were redundant but also they
were misleading the classification.
G-mail Dataset Test Results
Further, we have tested our Enron model on the
dataset created by using emails we have received in our
Gmail accounts during the period of last 3 months. The
results are shown in the Table 8. In this, experiment we
test dataset is completely non-overlapping with the
training set allowing us to truly evaluate the performance
of our system.
Table 8 : Personal Email Dataset Testing Results
Classifier
Testing Accuracy (%)
Random Forest
PART
96
97.33
VI.
Conclusion
In this paper we have studied previous
approaches of spam email detection using machine
learning methodologies. We have compared and
evaluated the approaches based on the factors such as
dataset used; features extracted, ranked and selected;
feature selection algorithms used and the results
received in terms of accuracy (precision, recall and error
rate) and performance (time required).
The datasets available for spam detection are
large in number and for such larger datasets Random
Forest and Part tend to produce better results with lower
error rates and higher precision. So, we used these two
classifiers to classify spam email detection. For
spambase dataset, we acquired the best percentage
© 2013 Global Journals Inc. (US)
accuracy of 99.918% with Random Forest which is 9%
better than previous spambase approaches and
96.416% with Part. For enron dataset, we acquired the
best percentage accuracy of 96.181% with Random
Forest and 95.093% with Part. Enron dataset is used by
[21] in an unsupervised spam learning and detection
scheme. The feature selection algorithms used also
contributed to achieve better accuracy with lower time
complexity due to dimensionality reduction. For
Random Forest, after using 70% of the feature set
extracted, for spambase data set, the training accuracy
remained the same (99.918%) whereas the computation
time reduced by 20% (from 1540ms – to 1276ms),
whereas for PART, the training accuracy is increased by
1.521% and computation time is reduced by 52% (from
4938 ms – to 2409ms).
References Références Referencias
1. “A Study of Spam E-mail classification using
Feature Selection package”, R.Parimala, Dr. R.
Nallaswamy, National Institute of Technology,
Global Journal of Computer Science and
Technology, Volume 11 Issue 7 Version 1.0 May
2011.
2. “Comparative Study on Email Spam Classifier using
Data Mining Techniques”, R. Kishore Kumar, G.
Poonkuzhali, P. Sudhakar, Member, IAENG,
Proceedings of the International Multiconference of
Engineers and Computer Scientists 2012 Vol I,
IMECS 2012, March 14-16, Hong Kong.
3. “Machine Learning Methods for Spam E-mail
Classification”, W.A. Awad and S.M. ELseuofi,
International Journal of Computer Applications
(0975 – 8887) Volume 16– No.1, February 2011.
4. “Email Spam Filtering using Supervised Machine
Learning Techniques”, V.Christina, S.Karpagavalli,
G.Suganya, (IJCSE) International Journal on
Computer Science and EngineeringVol. 02, No. 09,
2010, 3126-3129.
5. “Email Classification Using Data Reduction
Method”, Rafiqul Islam and Yang Xiang, member
IEEE, School of Information Technology Deakin
University, Burwood 3125, Victoria, Australia.
6. “Spam Classification based on Supervised Learning
using Machine Learning Techniques”, Ms.D
Karthika Renuka, Dr.T.Hamsapriya, Mr.M.Raja
Chakkaravarthi, Ms. P. Lakshmi Surya, 978-161284-764-1/11/$26.00 ©2011 IEEE.
7. “An Empirical Performance Comparison of Machine
Learning Methods for Spam E-mail Categorization”,
Chih-Chin Lai, Ming-Chi Tsai, Proceedings of the
Fourth International Conference on Hybrid Intelligent
Systems (HIS’04) 0-7695-2291-2/04 $ 20.00 IEEE.
8. “Introductory Statistics: Concepts, Models, and
Applications”, David W. Stockburger.
9. “Feature Subset Selection: A Correlation Based
Filter Approach”, Hall, M. A., Smith, L. A., 1997,
Efficient Spam Classification by Appropriate Feature Selection
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
Year 2 013
11.
57
Global Journal of Computer Science and Technology ( C
D ) Volume XIII Issue V Version I
10.
International Conference on Neural Information
Processing and Intelligent Information Systems,
Springer, p855-858.
“A practical approach to feature selection”, K. Kira
and L. A. Rendell, Proceedings of the Ninth
International Conference, 1992.
“Very simple classification rules perform well on
most commonly used datasets”, Holte, R.C.(1993)
Machine Learning, Vol. 11, 63–91.
“Induction of decision trees”, J.R. Quinlan, Machine
Learning 1, 81-106, 1986.
“UCI repository of Machine learning Databases”,
Department of Information and Computer Science,
of
California,
Irvine,
CA,
University
http://www.ics.uci.edu/~mlearn/MLRepository.html,
Hettich, S., Blake, C. L., and Merz, C. J.,1998.
“Random Forests”, Leo Breiman, Statistics
Department University of California Berkeley, CA
94720, January 2001.
“Exploiting Partial Decision Trees for Feature Subset
Selection in eMail Categorization”, Helmut Berger,
Dieter Merkl, Michael Dittenbach, SAC’06 April
2327, 2006, Dijon, France Copyright 2006 ACM
1595931082/06/0004.
“C4.5: Programs for Machine Learning”, J. R.
Quinlan, Morgan Kaufmann Publishers Inc., 1993.
“Fast effective rule induction”, W. W. Cohen, In
Proc. of the Int’l Conf. on Machine Learning, pages
115–123. Morgan Kaufmann, 1995.
“Toward optimal feature selection using Ranking
methods and classification Algorithms”, Jasmina
Novaković, PericaStrbac, DusanBulatović, March
2011.
“SpamAssassin”, http://spamassassin.apache.org.
The enron spam dataset http://www.aueb.gr/users/
ion/data/enron-spa
“A Case for Unsupervised-Learning-based Spam
Filtering”, Feng Qian, Abhinav Pathak, Y. Charlie
Hu, Z. Morley Mao, Yinglian Xie.
© 2013 Global Journals Inc. (US)
Global Journals Inc. (US) Guidelines Handbook
www.GlobalJournals.org