A Score Point based Email Spam Filtering Genetic Algorithm

ISSN:2229-6093
Preeti Trivedi et al, Int.J.Computer Technology & Applications,Vol 6 (6),955-960
A Score Point based Email Spam Filtering Genetic Algorithm
Preeti Trivedi
Kanpur Institute of Technology, Kanpur, India
Email: [email protected]
Mr. Sudhir Singh
Kanpur Institute of Technology, Kanpur, India
Email: [email protected]
Abstract
Abstract— E-mail is one of the most essential parts of
communications over internet today. However, each
day we spent several minutes in deleting spam related
to advertisement of products, offering loans at low
interest rates, drugs etc. Though spam filters are
capable to identify spam mails but spammers are
constantly evolving newer methods to send spam
messages to more and more people. With the advent of
technology mobile devices and other portable
electronic devices are now Wi-Fi enabled and internet
telephony VoIP (voice over internet protocol) has made
communicating across the world easier and
inexpensive. Social networks like Twitter, Facebook,
MySpace, orkut are very general means of connecting
with friends across universally. However this has
opened a newer audience for spammers to exploit.
Spam is not just limited to e-mail anymore, it is on
VoIP in the form of unsolicited marketing or
advertising phone calls, or marketing, advertising and
pornography links on social network. Spam is
everywhere. This paper presents a genetic algorithms
based spam filtering technique whose fitness function
is based on the score point. We have shown that the
considered algorithm provide a good recognition rate of
84% at FPR of 0.001.
1. Introduction
The world of the Internet today is so fast, cheap and
efficient. These important trio features made it a
reliable communication for personal, business and
academic purposes. Therefore, e-mail is a dynamic tool
of business and personal means of sending personal
correspondence [1]. However, the start of e-mails has
definitely reduced greeting cards, letters because of the
ease and speed of e-mail delivery.
Now days we are not only receiving solicited mails
called HAM mails but also receive unwanted mails
called SPAM mails. Generally, spam is “unsolicited
commercial e-mail, it contains mails that different email users do not want, and only directed to the e-mail
IJCTA | Nov-Dec 2015
Available [email protected]
owner or end users. Thus, spam is seen as unwanted emails sent out in bulk and is mainly used by companies
or unethical marketers’ in order to advertise their
products. Therefore, spam messages or mails are used
for marketing of goods or rendering of useless services.
Some spam e-mails contents appear in the form of
jokes or scams sent out in bulk. Today, most spam
mails are about advertising of cheap software,
jewellery, medical supplies, money making schemes
and pornography” to mention but a few. Most of these
spam mails may have a single attractive image which
come with the message and may tempt the user to click
on it. Some of the message may appear in form of links
that may have been directly linked up to a server where
the innocent users get infected with malwares [2].
Despite of beneficial features of the Internet, it is also
gradually becoming in support of malicious activities.
Over the years, worms, viruses and security issues have
been seen as the major problems that information
technology is facing. However, spam is slowly taking a
different dimension on a daily basis and, therefore,
becoming as problematic as worms and viruses. The
distribution of bulk messages by unknown senders cost
little or no cost [4-6]. Hence, e-mail users still spend
lots of time and effort to recognize legitimate mails and
deleting it from their mailbox. However, it is irritating
when users check their mail box and find over
thousands of mails from unknown sender with a
reasonably good heading but irrelevant content that
does not concern the users. Hence, this is just for
spammers to promote their unethical markets and may
be called spam.
Now the spam has appeared as one of the largest
irritating problems of the internet. Every email users
receives numerous spam mails daily and still there is no
proper solution to protect an active email from
becoming a spam target. In spite of all defenses there is
the risk that the wrong people may obtain the address
through a ink on a website, subscription to a newsletter,
participation in a contest, a virus or worm on the
system of a friend. There are many possible ways
955
ISSN:2229-6093
Preeti Trivedi et al, Int.J.Computer Technology & Applications,Vol 6 (6),955-960
where an address can be a bulk mailer, and once it
happens there is no way to stop those advertisers from
spreading a flood of “great deals”. It is also very
annoying for home users to stop spam. Companies
compromises in numerous ways from the spam.
Employees waste their time to scan to extract the good
mails. In this procedure there are the probabilities of
missing an important mail, which increases with the
number of spam mails present. Due to numbers of spam
mails the mail servers workload increases. This can
lead to system slowdowns and thus to a reduced
effectiveness of the company’s workflow.
As many researchers all over the world are involved
in extensive research to fight spam. Still a solution with
very good efficiency is not available. As spam filtering
is complex problem, it is not possible to spam emails
with one single solution. As the structure of the spam
emails is changing continuously, therefore we need a
solution which can be adaptive in nature. In this paper,
a genetic algorithm based spam filtering method is
proposed, here, the fitness function is designed using
the experimental results of the genetic algorithms.
2. Challenging in Spam e-mail filtering
The growth of some social media website such as
Facebook, MySpace has helped spammers in using
trick to retrieve information such as e-mail and
password of users [5]. According to the market survey
about the social network users’ in 2008, 83% of the
users have received at-least one unwanted messages
(spam) or friend request [6-7].
The use of downloaded spyware on users’ computer
without the knowledge of the owner is a method which
spammers used in carrying out their operations. The
spyware acts by rooting through the computer until it
locates the users’ address book so as to use it in
promoting their unethical market. A lot needs to be
done about the increment of the computer security
because the spammers make use of innocent users’
compute or impersonate legitimate business in order to
send a lot of spam mails. Apart from that spammers
also use the tool to steal bank details, social security
number and some credit card information in order to
promote their market and identity theft.
Therefore, it makes it very difficult for ISPs to locate
the source of the spam if spammers engage in using this
machine for operation. Advisedly, most computer users
should make sure they protect their machine with the
use of firewall, anti-virus and anti-spyware scan has
recommended by different spam fighting companies.
Spam filtering method is not only used for technical
purposes such as overspreading of network bandwidth
and email storage, but also related to social issues such
as child safety, and phishing email. Many of this
methods has done lot of good job for ISPs by helping
IJCTA | Nov-Dec 2015
Available [email protected]
them to safe at least a ton of money and also filtering of
spam mails into junk, allowing legitimate mails to be
delivered successfully, but still they cannot be relied on
because they are not so effective.
Most spam filters are liable to false positive when
legitimate mails are classified as spam and false
negative when junk mails arrives in user inbox.
Therefore the need to consider the occurrence of false
positive and false negative is highly essential when it
comes to filter evaluation.
The two first items are the ones that we want to make
happen. The two last items are the outcomes we do not
want. To measure these we define false positive ratios
(FPR) and false negative ratios (FNR) as follows:
Emails wrongly identified as spam
Total emails
Emails wrongly identified as ham
.
FNR =
Total emails
3. Genetic Algorithms
FPR =
The Genetic Algorithm technique has many advantages
over traditional non-linear solution techniques.
However, both of these techniques do not always
achieve an optimal solution. However, GA provides
near optimal solution easily in comparison to other
methods.
Genetic Algorithm Steps
The details of how Genetic Algorithms work are
explained below. The general layout of the genetic
operations is shown below in Figure.
The basic idea of the GA algorithm is to generate
offspring which are better than the previous generation.
As detailed in the above figure, we generate a solution
by selecting parents, if solution is good enough then we
stop, otherwise we select new parents to reproduce,
then again crossover is performed to generate new offsprings. This process continues until offspring of
desired results are not generated.
A. Initialization
In genetic algorithm based solution, in general initial
population is generated randomly. However, in some
applications where range of the solution is known,
initial population generation is not random. With this
approach, we are able to provide the GA a good start
point and speed up the evolutionary process.
B. Reproduction
In reproduction, the complete population is replaced in
each generation using survival of fittest phenomenon.
In this approach, two mate of the old generation are
coupled together to produce two offspring. For the
initial population of N, this procedure is repeated N/2
times and thus producing N newly generated
chromosomes.
Genetic Algorithm Flow Chart
956
ISSN:2229-6093
Preeti Trivedi et al, Int.J.Computer Technology & Applications,Vol 6 (6),955-960
useful since crossover may not be able to produce new
alleles if they do not appear in the initial generation and
a new type of chromosomes can be generated with old
and new character.
4. Genetic Algorithm in E-mail Filtering Process
Genetic algorithm can be used as spam classifier. The
collection of the e-mails is called corpus [2-4[. Spam
mails for the corpus are encoded into a class of
chromosomes and these chromosomes undergo with
genetic operations, i.e., crossover, mutation and fitness
function etc.. The rules set for spam mails are
developed using the genetic algorithm.
Rules for classifying the emails:
Fig. 1 Genetic Operation and Evolution Process
C. Parent Selection mechanism
In general, the chance of each parent being selected is
related to its fitness. In general probabilistic method is
used for the parent selection.
Fitness-based selection
In this work, parent selection is done using Roulette
Wheel selection or fitness-based selection. In this
method of parent selection, survival of each
chromosome is directly proportional to its fitness. The
effect of this depends on the range of fitness values in
the current population.
D. Crossover Operator
The crossover is the most important operation in GA as
it produces offspring. Crossover as name suggests is a
process of recombination of bit strings via an exchange
of segments between pairs of chromosomes. In this
paper, one point crossover is considered.
One-point Crossover
In one point cross-over, a bit position is randomly
selected that need to change. Here, the bits before the
number keep unchanged and swap the bits after the
crossover position between the two parents.
The weight of the words of gene in testing mail and the
weight of words of gene in spam mail prototypes are
compared and matched gene is find. If the matched
gene is greater than some number let say ‘x’ then mail
is considered as spam.
Fitness Function:
1 SPAM mail
F =
0 Ham mail
The basic idea is to find SPAM and HAM mails form
the mails arriving in the mail box. As the fitness
function is itself problem dependent and cannot be
fixed initially in SPAM email filtering. For the
evolution of the fitness function we carried out
experiments and we found that the minimum score
point for the available 1346 SPAM mails was 1 for
SPAM mails. Hence, we defined our fitness function as
1 Score point ≥ 1
F =
0 Score point < 1
In the earlier work [5], minimum score pint 3 is
considered for classification of SPAM and HAM mails.
The GA based methods only look for the work
available in data dictionary in spam classifications.
Fig. 2 Schematic of one point crossover
(f) Mutation
Mutation has the effect of ensuring that all possible
chromosomes can maintain good gene in the newly
generated chromosomes. The mutation operator can
overcome this by simply randomly selecting any bit
position in a string and flipping it if required. This is
IJCTA | Nov-Dec 2015
Available [email protected]
In general look for a word in arriving emails and
deciding on the basis of that word as SPAM and HAM
will produce a lot of error. Hence, it is necessary to
look for the words, their frequency and total number of
words in an email for more accurate classification of
957
ISSN:2229-6093
Preeti Trivedi et al, Int.J.Computer Technology & Applications,Vol 6 (6),955-960
mails. In GA words in data dictionary can be added or
deleted so it is adaptive in nature.
In our experiment we considered database of 2448
emails, out of which 1346 are SPAM mails and rest
1102 mails are HAM mails [7]. In the data-dictionary
421 words are taken which is further sub-divided into
seven categories C1-C7. The data dictionary can be
found in appendix-A.
Then after normalization the weights are converted in
the range of 0.000 to 1.000. And using the hex
representation we have
The weight of the gene can be encoded as
Binary 0000000000 represents weight 0.000
Binary 0000000001 represents weight 0.001
Binary 0000000010 represents weight 0.002
………………………………………………
……………………………………………….
Binary 1111100111 represents weight 0.999
Binary 1111111000 represents weight 1.000
The procedure of calculating weights for a word of a
particular group is detailed below:
Lets for an example an email consists of four words
namely hotel, luxury, tax, transaction. Out of these four
words tax, transaction belongs to categories C2 and
hotel, luxury, belongs to categories C5 (see Appendix A). Let us consider an email with 567 words, out of
which 237 words are hotel, luxury, tax, transaction with
frequency 84, 23, 97 and 33 respectively. These words
are taken so large in number to make sure that the
considered mail is a spam mail as the spam database is
very small as it contains only 421 words. The extracted
words form the emails are first classify as whether they
belongs to any spam database category. Once if words
in email match word in spam data dictionary then the
probability of getting a word from the spam database is
using simple formula
Ww =
SWM
, where
TWM
SWM : Total spam word in e-mail
TWM : Total word in e-mail
The pw f or the word ‘hotel’ is
=
Ww
84
= 0.1482
567
Similarly the weight of word ‘luxury’ is 0.04056
The weight of the category is calculated by taking the
average of the category for example the weight of
category C1 is (0.1482+ 0.04056)/2=0.0944.
Thus the obtained weight for each word is tabulated in
Table 1.
Table 1: Calculation of weights under average
weightage method
Group Word
Frequency Weight of Weight of
word
group
tax
97
0.1711
0.1147
C
2
C2
transaction 33
0.0582
hotel
84
0.1482
C5
luxury
23
0.04056
C5
IJCTA | Nov-Dec 2015
Available [email protected]
Figure 2 SPAM chromosomes prototype
As shown above in figure 2, each mail is encoded into
chromosomes consists of 70 bits, 10 bits for each
categories. The weight value is represented in hex
number format.
Once, chromosomes are constructed for all the mails.
The process of genetic algorithm starts and crossover
takes place. In each generation of chromosomes only
12% are crossed. The mutation rate is only 3 %.
The weight of the words of gene in testing mail and the
weight of words of gene in spam mail prototype are
compared to find the matched gene. If number of
matched gene, is greater than or equal to one, than
spam mail prototype will receive one score point. If the
score point are greater than some threshold score points
than the mail is considered as spam mail. However, the
threshold point can be manually adjusted to get the
appropriate results. It must be remembered that we
have used the fitness function on the basis of our
experimental results.
0.0944
958
ISSN:2229-6093
Preeti Trivedi et al, Int.J.Computer Technology & Applications,Vol 6 (6),955-960
5. Results
Figure 3 Recognition Rate Vs. FPR.
In figure 3, recognition rate Vs. False Acceptance Rate
(FPR) is plotted. Here at the FPR level of 0.001 the
recognition rate is nearly 84% which increases to
97.5% at the FPR level of 1.
In our early results we found that, if number of words
in the mail is larger, then more correct classification is
possible. We have checked our algorithm on large
corpus of 2248 mails out of which 1346 were SPAM
mails and rest of them were HAM mails. The results on
such a large email corpus are taken into account to see
more accurate classifications of mail and effectiveness
of GA algorithm. As in earlier experiments only 140
mails were considered and it was stated that the
running time of GA is very large, therefore larger size
of mail corpus will take much time. However, we did
this experiment on the high end machine to get more
clear and accurate picture of the GA. In our
experiments we found that the nearly 84% mails are
correctly classified by our method at the FPR level of
0.001, which increases to 97.5 at the FPR level of 1.
6. Conclusion
This paper, discusses the Genetic algorithms based email spam classification method. The major advantages
of GA based approach is that, it is adaptive in nature.
This technique can be applied on the user side; the
recognition rate is 84% at the FPR of 0.001. As the
recognition rate is not very high, this technique can be
used along with some other technique like neural
network, Bayesian classifier. The results of two
techniques can be combined together using fuzzy logic
to get higher recognition rate.
IJCTA | Nov-Dec 2015
Available [email protected]
6. References
[1] Enrico Blanzieri and Anton Bryl, “A Survey of
Learning
Based Techniques of Email Spam Filtering,”
Conference on Email and Anti-Spam, 2008.
[2] K.S. Tang et.al., “Genetic Algorithm and Their
Applications” IEEE Signal Processing magazine,
pp.22-37, Nov. 1996.
[3] Usarat Sanpakdee,et.al., Adaptive Spam Mail
Filtering Using Genetic Algorithm” ICACT 2006”.
[4] Yang J, Honavar V., “Feature subset selection using
a genetic algorithm,”. Intelligent Systems and their
Applications, IEEE, Vol.13, No.2,:pp.44-49,1998.
[5] Shrivastava, J. N., & Bindu, M. H., “E-mail Spam
Filtering Using Adaptive Genetic Algorithm,”
International Journal of Intelligent Systems &
Applications, Vol. 6, No.2, pp.54-60,2014.
[6] Karimpour, J., A.A. Noroozi, and A. Abadi., “The
Impact of Feature Selection on Web Spam detection,”
International Journal of Intelligent Systems and
Applications (IJISA), Vol. 4, No. 9, p. 61, 2012.
[7] Spam Assassin, http://spamassassin.org.
Appendix A
Group
Content
C1
Adult
C2
Financial
Example of keywords in
each group
adult, aphrodisiac, big, cam,
climax, company, cum,
desire, erotic, fantasy, fuck,
gay, girl, great, guy, hard,
hardcore, heaven, hot, huge,
long, man, max, max length,
nude,
orgasm,
penis,
performance,
pheromone,
pill, porn, powerful, pussy,
satisfy, sex, stamina, sweet,
teen, Viagra, webcam, x,
xxx, xxx-porn, young, love,
teen, anus
Account, accountant, alert,
analyst, attorney, bank,
bankruptcy, benefit, bill,
billing,
broker,
budget,
building,
cash,
cheque,
commission,
consolidate,
court,
credit,
creditor,
currency, customer, debt,
deposit, discover, economy,
entrepreneur,
estate,
exchange,
fee,
finance,
freedom, fund, help, highrisk, insurance, invest,
investor, judgment, legal,
legitimate, lender, loan,
959
ISSN:2229-6093
Preeti Trivedi et al, Int.J.Computer Technology & Applications,Vol 6 (6),955-960
mastercard,
mortgage,
obligate,
pay,
payable,
payable, paycheck, promote,
purchase, rate, refinance,
refund, rent, revenue, risk,
service, statement, stock,
support, tax, transaction, vat,
visa, wealth, worth, service
C3
C4
Commercial
Beauty and
diet
college,
commerce,
computer, cost, deliver,
discount,
especial,
expensive, express, fantastic,
free, furnishing, furniture,
game, get, gif, gift, great,
guarantee,
inexpensive,
invite, item, just, keyboard,
license, lifetime, magazine,
maintenance, mall, market,
material, materials, mobile,
motherboard, mouse, offer,
online, only,
order,
palm,
pamphlet,
percent, premium, price,
produce, product, program,
recommend, refill, release,
resell, reseller, retail, sale,
save, save, sell, ship,
shipping, shop, shopping,
special, subscribe, supply,
surprise, trade, trademark,
upgrade, voucher, whole,
wholesale, within
after, age, amaze, anti-aging,
appetite, beauty, become,
before, believe, blood, body,
botanic, breast, build, burn,
Diet calorie, capsule, card,
cell,
change,
chemical,
cholesterol, confirm, course,
diet, difference, dose, drug,
effect, effective, eliminate,
energy, enhance, exercise,
eye, face, fast, fat, firm, fit,
fitness, flexible, gary, grow,
grown, growth, hair, health,
healthcare, heart, height,
herb,
herbal,
hormone,
improve, inche, incredible,
kidney, large,
laser, life-changing, light,
lose, loss, low, magic,
medicine,
metabolism,
micro-cap, miracle, modem,
move,
muscle,
nature,
nutrient,
old,
over,
IJCTA | Nov-Dec 2015
Available [email protected]
overweight,
permanent,
plain, potential, pound,
power,
protect, reduce,
remanufacture,
repair,
restore, retain, reverse, safe,
satisfaction, secret, size,
step, strength, strong, tablet,
therapy,
thin,
toxin,
treatment, under, virginia,
vitamin, weight, woman,
wonderful, wrinkle
C5
Traveling
book, deluxe, excite, guide,
holiday,
honest,
hotel,
luxury, meal, package, plan,
problem,
relax,
relief,
reserve, resort, summer,
temple, ticket, tour, train,
travel,
traveler,
trip,
vacation,
C6
HomeBased
address, astonishment, base,
broadcast, bulk, business,
comfort, connect, demo,
domain,
downline,
download, Business earn,
email, emailing, ethernet,
facemail,
fresh,
home,
homebased,
homeworker,
host,
income,
interest,
international,
internet,
investigate,
job,
list,
lucrative, mail, mailbox,
mailer,
mailing,
make,
marketing, message, million,
money-making, opportunity,
part-time, people, private,
profit,
reach,
receive,
recipient, require, re-register,
return, server, software,
subscriber, success, teach,
unsubscribe, user, visit,
website, work, work-athome, worker, working
C7
Gambling
action, award, bet, bonus,
casino, challenge, extra,
gambling, gold, hunt, lass,
lucky, millionaire, player,
poker, prize, reward, rich,
vegas, win, lottery
960