ISSN:2229-6093 Preeti Trivedi et al, Int.J.Computer Technology & Applications,Vol 6 (6),955-960 A Score Point based Email Spam Filtering Genetic Algorithm Preeti Trivedi Kanpur Institute of Technology, Kanpur, India Email: [email protected] Mr. Sudhir Singh Kanpur Institute of Technology, Kanpur, India Email: [email protected] Abstract Abstract— E-mail is one of the most essential parts of communications over internet today. However, each day we spent several minutes in deleting spam related to advertisement of products, offering loans at low interest rates, drugs etc. Though spam filters are capable to identify spam mails but spammers are constantly evolving newer methods to send spam messages to more and more people. With the advent of technology mobile devices and other portable electronic devices are now Wi-Fi enabled and internet telephony VoIP (voice over internet protocol) has made communicating across the world easier and inexpensive. Social networks like Twitter, Facebook, MySpace, orkut are very general means of connecting with friends across universally. However this has opened a newer audience for spammers to exploit. Spam is not just limited to e-mail anymore, it is on VoIP in the form of unsolicited marketing or advertising phone calls, or marketing, advertising and pornography links on social network. Spam is everywhere. This paper presents a genetic algorithms based spam filtering technique whose fitness function is based on the score point. We have shown that the considered algorithm provide a good recognition rate of 84% at FPR of 0.001. 1. Introduction The world of the Internet today is so fast, cheap and efficient. These important trio features made it a reliable communication for personal, business and academic purposes. Therefore, e-mail is a dynamic tool of business and personal means of sending personal correspondence [1]. However, the start of e-mails has definitely reduced greeting cards, letters because of the ease and speed of e-mail delivery. Now days we are not only receiving solicited mails called HAM mails but also receive unwanted mails called SPAM mails. Generally, spam is “unsolicited commercial e-mail, it contains mails that different email users do not want, and only directed to the e-mail IJCTA | Nov-Dec 2015 Available [email protected] owner or end users. Thus, spam is seen as unwanted emails sent out in bulk and is mainly used by companies or unethical marketers’ in order to advertise their products. Therefore, spam messages or mails are used for marketing of goods or rendering of useless services. Some spam e-mails contents appear in the form of jokes or scams sent out in bulk. Today, most spam mails are about advertising of cheap software, jewellery, medical supplies, money making schemes and pornography” to mention but a few. Most of these spam mails may have a single attractive image which come with the message and may tempt the user to click on it. Some of the message may appear in form of links that may have been directly linked up to a server where the innocent users get infected with malwares [2]. Despite of beneficial features of the Internet, it is also gradually becoming in support of malicious activities. Over the years, worms, viruses and security issues have been seen as the major problems that information technology is facing. However, spam is slowly taking a different dimension on a daily basis and, therefore, becoming as problematic as worms and viruses. The distribution of bulk messages by unknown senders cost little or no cost [4-6]. Hence, e-mail users still spend lots of time and effort to recognize legitimate mails and deleting it from their mailbox. However, it is irritating when users check their mail box and find over thousands of mails from unknown sender with a reasonably good heading but irrelevant content that does not concern the users. Hence, this is just for spammers to promote their unethical markets and may be called spam. Now the spam has appeared as one of the largest irritating problems of the internet. Every email users receives numerous spam mails daily and still there is no proper solution to protect an active email from becoming a spam target. In spite of all defenses there is the risk that the wrong people may obtain the address through a ink on a website, subscription to a newsletter, participation in a contest, a virus or worm on the system of a friend. There are many possible ways 955 ISSN:2229-6093 Preeti Trivedi et al, Int.J.Computer Technology & Applications,Vol 6 (6),955-960 where an address can be a bulk mailer, and once it happens there is no way to stop those advertisers from spreading a flood of “great deals”. It is also very annoying for home users to stop spam. Companies compromises in numerous ways from the spam. Employees waste their time to scan to extract the good mails. In this procedure there are the probabilities of missing an important mail, which increases with the number of spam mails present. Due to numbers of spam mails the mail servers workload increases. This can lead to system slowdowns and thus to a reduced effectiveness of the company’s workflow. As many researchers all over the world are involved in extensive research to fight spam. Still a solution with very good efficiency is not available. As spam filtering is complex problem, it is not possible to spam emails with one single solution. As the structure of the spam emails is changing continuously, therefore we need a solution which can be adaptive in nature. In this paper, a genetic algorithm based spam filtering method is proposed, here, the fitness function is designed using the experimental results of the genetic algorithms. 2. Challenging in Spam e-mail filtering The growth of some social media website such as Facebook, MySpace has helped spammers in using trick to retrieve information such as e-mail and password of users [5]. According to the market survey about the social network users’ in 2008, 83% of the users have received at-least one unwanted messages (spam) or friend request [6-7]. The use of downloaded spyware on users’ computer without the knowledge of the owner is a method which spammers used in carrying out their operations. The spyware acts by rooting through the computer until it locates the users’ address book so as to use it in promoting their unethical market. A lot needs to be done about the increment of the computer security because the spammers make use of innocent users’ compute or impersonate legitimate business in order to send a lot of spam mails. Apart from that spammers also use the tool to steal bank details, social security number and some credit card information in order to promote their market and identity theft. Therefore, it makes it very difficult for ISPs to locate the source of the spam if spammers engage in using this machine for operation. Advisedly, most computer users should make sure they protect their machine with the use of firewall, anti-virus and anti-spyware scan has recommended by different spam fighting companies. Spam filtering method is not only used for technical purposes such as overspreading of network bandwidth and email storage, but also related to social issues such as child safety, and phishing email. Many of this methods has done lot of good job for ISPs by helping IJCTA | Nov-Dec 2015 Available [email protected] them to safe at least a ton of money and also filtering of spam mails into junk, allowing legitimate mails to be delivered successfully, but still they cannot be relied on because they are not so effective. Most spam filters are liable to false positive when legitimate mails are classified as spam and false negative when junk mails arrives in user inbox. Therefore the need to consider the occurrence of false positive and false negative is highly essential when it comes to filter evaluation. The two first items are the ones that we want to make happen. The two last items are the outcomes we do not want. To measure these we define false positive ratios (FPR) and false negative ratios (FNR) as follows: Emails wrongly identified as spam Total emails Emails wrongly identified as ham . FNR = Total emails 3. Genetic Algorithms FPR = The Genetic Algorithm technique has many advantages over traditional non-linear solution techniques. However, both of these techniques do not always achieve an optimal solution. However, GA provides near optimal solution easily in comparison to other methods. Genetic Algorithm Steps The details of how Genetic Algorithms work are explained below. The general layout of the genetic operations is shown below in Figure. The basic idea of the GA algorithm is to generate offspring which are better than the previous generation. As detailed in the above figure, we generate a solution by selecting parents, if solution is good enough then we stop, otherwise we select new parents to reproduce, then again crossover is performed to generate new offsprings. This process continues until offspring of desired results are not generated. A. Initialization In genetic algorithm based solution, in general initial population is generated randomly. However, in some applications where range of the solution is known, initial population generation is not random. With this approach, we are able to provide the GA a good start point and speed up the evolutionary process. B. Reproduction In reproduction, the complete population is replaced in each generation using survival of fittest phenomenon. In this approach, two mate of the old generation are coupled together to produce two offspring. For the initial population of N, this procedure is repeated N/2 times and thus producing N newly generated chromosomes. Genetic Algorithm Flow Chart 956 ISSN:2229-6093 Preeti Trivedi et al, Int.J.Computer Technology & Applications,Vol 6 (6),955-960 useful since crossover may not be able to produce new alleles if they do not appear in the initial generation and a new type of chromosomes can be generated with old and new character. 4. Genetic Algorithm in E-mail Filtering Process Genetic algorithm can be used as spam classifier. The collection of the e-mails is called corpus [2-4[. Spam mails for the corpus are encoded into a class of chromosomes and these chromosomes undergo with genetic operations, i.e., crossover, mutation and fitness function etc.. The rules set for spam mails are developed using the genetic algorithm. Rules for classifying the emails: Fig. 1 Genetic Operation and Evolution Process C. Parent Selection mechanism In general, the chance of each parent being selected is related to its fitness. In general probabilistic method is used for the parent selection. Fitness-based selection In this work, parent selection is done using Roulette Wheel selection or fitness-based selection. In this method of parent selection, survival of each chromosome is directly proportional to its fitness. The effect of this depends on the range of fitness values in the current population. D. Crossover Operator The crossover is the most important operation in GA as it produces offspring. Crossover as name suggests is a process of recombination of bit strings via an exchange of segments between pairs of chromosomes. In this paper, one point crossover is considered. One-point Crossover In one point cross-over, a bit position is randomly selected that need to change. Here, the bits before the number keep unchanged and swap the bits after the crossover position between the two parents. The weight of the words of gene in testing mail and the weight of words of gene in spam mail prototypes are compared and matched gene is find. If the matched gene is greater than some number let say ‘x’ then mail is considered as spam. Fitness Function: 1 SPAM mail F = 0 Ham mail The basic idea is to find SPAM and HAM mails form the mails arriving in the mail box. As the fitness function is itself problem dependent and cannot be fixed initially in SPAM email filtering. For the evolution of the fitness function we carried out experiments and we found that the minimum score point for the available 1346 SPAM mails was 1 for SPAM mails. Hence, we defined our fitness function as 1 Score point ≥ 1 F = 0 Score point < 1 In the earlier work [5], minimum score pint 3 is considered for classification of SPAM and HAM mails. The GA based methods only look for the work available in data dictionary in spam classifications. Fig. 2 Schematic of one point crossover (f) Mutation Mutation has the effect of ensuring that all possible chromosomes can maintain good gene in the newly generated chromosomes. The mutation operator can overcome this by simply randomly selecting any bit position in a string and flipping it if required. This is IJCTA | Nov-Dec 2015 Available [email protected] In general look for a word in arriving emails and deciding on the basis of that word as SPAM and HAM will produce a lot of error. Hence, it is necessary to look for the words, their frequency and total number of words in an email for more accurate classification of 957 ISSN:2229-6093 Preeti Trivedi et al, Int.J.Computer Technology & Applications,Vol 6 (6),955-960 mails. In GA words in data dictionary can be added or deleted so it is adaptive in nature. In our experiment we considered database of 2448 emails, out of which 1346 are SPAM mails and rest 1102 mails are HAM mails [7]. In the data-dictionary 421 words are taken which is further sub-divided into seven categories C1-C7. The data dictionary can be found in appendix-A. Then after normalization the weights are converted in the range of 0.000 to 1.000. And using the hex representation we have The weight of the gene can be encoded as Binary 0000000000 represents weight 0.000 Binary 0000000001 represents weight 0.001 Binary 0000000010 represents weight 0.002 ……………………………………………… ………………………………………………. Binary 1111100111 represents weight 0.999 Binary 1111111000 represents weight 1.000 The procedure of calculating weights for a word of a particular group is detailed below: Lets for an example an email consists of four words namely hotel, luxury, tax, transaction. Out of these four words tax, transaction belongs to categories C2 and hotel, luxury, belongs to categories C5 (see Appendix A). Let us consider an email with 567 words, out of which 237 words are hotel, luxury, tax, transaction with frequency 84, 23, 97 and 33 respectively. These words are taken so large in number to make sure that the considered mail is a spam mail as the spam database is very small as it contains only 421 words. The extracted words form the emails are first classify as whether they belongs to any spam database category. Once if words in email match word in spam data dictionary then the probability of getting a word from the spam database is using simple formula Ww = SWM , where TWM SWM : Total spam word in e-mail TWM : Total word in e-mail The pw f or the word ‘hotel’ is = Ww 84 = 0.1482 567 Similarly the weight of word ‘luxury’ is 0.04056 The weight of the category is calculated by taking the average of the category for example the weight of category C1 is (0.1482+ 0.04056)/2=0.0944. Thus the obtained weight for each word is tabulated in Table 1. Table 1: Calculation of weights under average weightage method Group Word Frequency Weight of Weight of word group tax 97 0.1711 0.1147 C 2 C2 transaction 33 0.0582 hotel 84 0.1482 C5 luxury 23 0.04056 C5 IJCTA | Nov-Dec 2015 Available [email protected] Figure 2 SPAM chromosomes prototype As shown above in figure 2, each mail is encoded into chromosomes consists of 70 bits, 10 bits for each categories. The weight value is represented in hex number format. Once, chromosomes are constructed for all the mails. The process of genetic algorithm starts and crossover takes place. In each generation of chromosomes only 12% are crossed. The mutation rate is only 3 %. The weight of the words of gene in testing mail and the weight of words of gene in spam mail prototype are compared to find the matched gene. If number of matched gene, is greater than or equal to one, than spam mail prototype will receive one score point. If the score point are greater than some threshold score points than the mail is considered as spam mail. However, the threshold point can be manually adjusted to get the appropriate results. It must be remembered that we have used the fitness function on the basis of our experimental results. 0.0944 958 ISSN:2229-6093 Preeti Trivedi et al, Int.J.Computer Technology & Applications,Vol 6 (6),955-960 5. Results Figure 3 Recognition Rate Vs. FPR. In figure 3, recognition rate Vs. False Acceptance Rate (FPR) is plotted. Here at the FPR level of 0.001 the recognition rate is nearly 84% which increases to 97.5% at the FPR level of 1. In our early results we found that, if number of words in the mail is larger, then more correct classification is possible. We have checked our algorithm on large corpus of 2248 mails out of which 1346 were SPAM mails and rest of them were HAM mails. The results on such a large email corpus are taken into account to see more accurate classifications of mail and effectiveness of GA algorithm. As in earlier experiments only 140 mails were considered and it was stated that the running time of GA is very large, therefore larger size of mail corpus will take much time. However, we did this experiment on the high end machine to get more clear and accurate picture of the GA. In our experiments we found that the nearly 84% mails are correctly classified by our method at the FPR level of 0.001, which increases to 97.5 at the FPR level of 1. 6. Conclusion This paper, discusses the Genetic algorithms based email spam classification method. The major advantages of GA based approach is that, it is adaptive in nature. This technique can be applied on the user side; the recognition rate is 84% at the FPR of 0.001. As the recognition rate is not very high, this technique can be used along with some other technique like neural network, Bayesian classifier. The results of two techniques can be combined together using fuzzy logic to get higher recognition rate. IJCTA | Nov-Dec 2015 Available [email protected] 6. References [1] Enrico Blanzieri and Anton Bryl, “A Survey of Learning Based Techniques of Email Spam Filtering,” Conference on Email and Anti-Spam, 2008. [2] K.S. Tang et.al., “Genetic Algorithm and Their Applications” IEEE Signal Processing magazine, pp.22-37, Nov. 1996. [3] Usarat Sanpakdee,et.al., Adaptive Spam Mail Filtering Using Genetic Algorithm” ICACT 2006”. [4] Yang J, Honavar V., “Feature subset selection using a genetic algorithm,”. Intelligent Systems and their Applications, IEEE, Vol.13, No.2,:pp.44-49,1998. [5] Shrivastava, J. N., & Bindu, M. H., “E-mail Spam Filtering Using Adaptive Genetic Algorithm,” International Journal of Intelligent Systems & Applications, Vol. 6, No.2, pp.54-60,2014. [6] Karimpour, J., A.A. Noroozi, and A. Abadi., “The Impact of Feature Selection on Web Spam detection,” International Journal of Intelligent Systems and Applications (IJISA), Vol. 4, No. 9, p. 61, 2012. [7] Spam Assassin, http://spamassassin.org. Appendix A Group Content C1 Adult C2 Financial Example of keywords in each group adult, aphrodisiac, big, cam, climax, company, cum, desire, erotic, fantasy, fuck, gay, girl, great, guy, hard, hardcore, heaven, hot, huge, long, man, max, max length, nude, orgasm, penis, performance, pheromone, pill, porn, powerful, pussy, satisfy, sex, stamina, sweet, teen, Viagra, webcam, x, xxx, xxx-porn, young, love, teen, anus Account, accountant, alert, analyst, attorney, bank, bankruptcy, benefit, bill, billing, broker, budget, building, cash, cheque, commission, consolidate, court, credit, creditor, currency, customer, debt, deposit, discover, economy, entrepreneur, estate, exchange, fee, finance, freedom, fund, help, highrisk, insurance, invest, investor, judgment, legal, legitimate, lender, loan, 959 ISSN:2229-6093 Preeti Trivedi et al, Int.J.Computer Technology & Applications,Vol 6 (6),955-960 mastercard, mortgage, obligate, pay, payable, payable, paycheck, promote, purchase, rate, refinance, refund, rent, revenue, risk, service, statement, stock, support, tax, transaction, vat, visa, wealth, worth, service C3 C4 Commercial Beauty and diet college, commerce, computer, cost, deliver, discount, especial, expensive, express, fantastic, free, furnishing, furniture, game, get, gif, gift, great, guarantee, inexpensive, invite, item, just, keyboard, license, lifetime, magazine, maintenance, mall, market, material, materials, mobile, motherboard, mouse, offer, online, only, order, palm, pamphlet, percent, premium, price, produce, product, program, recommend, refill, release, resell, reseller, retail, sale, save, save, sell, ship, shipping, shop, shopping, special, subscribe, supply, surprise, trade, trademark, upgrade, voucher, whole, wholesale, within after, age, amaze, anti-aging, appetite, beauty, become, before, believe, blood, body, botanic, breast, build, burn, Diet calorie, capsule, card, cell, change, chemical, cholesterol, confirm, course, diet, difference, dose, drug, effect, effective, eliminate, energy, enhance, exercise, eye, face, fast, fat, firm, fit, fitness, flexible, gary, grow, grown, growth, hair, health, healthcare, heart, height, herb, herbal, hormone, improve, inche, incredible, kidney, large, laser, life-changing, light, lose, loss, low, magic, medicine, metabolism, micro-cap, miracle, modem, move, muscle, nature, nutrient, old, over, IJCTA | Nov-Dec 2015 Available [email protected] overweight, permanent, plain, potential, pound, power, protect, reduce, remanufacture, repair, restore, retain, reverse, safe, satisfaction, secret, size, step, strength, strong, tablet, therapy, thin, toxin, treatment, under, virginia, vitamin, weight, woman, wonderful, wrinkle C5 Traveling book, deluxe, excite, guide, holiday, honest, hotel, luxury, meal, package, plan, problem, relax, relief, reserve, resort, summer, temple, ticket, tour, train, travel, traveler, trip, vacation, C6 HomeBased address, astonishment, base, broadcast, bulk, business, comfort, connect, demo, domain, downline, download, Business earn, email, emailing, ethernet, facemail, fresh, home, homebased, homeworker, host, income, interest, international, internet, investigate, job, list, lucrative, mail, mailbox, mailer, mailing, make, marketing, message, million, money-making, opportunity, part-time, people, private, profit, reach, receive, recipient, require, re-register, return, server, software, subscriber, success, teach, unsubscribe, user, visit, website, work, work-athome, worker, working C7 Gambling action, award, bet, bonus, casino, challenge, extra, gambling, gold, hunt, lass, lucky, millionaire, player, poker, prize, reward, rich, vegas, win, lottery 960
© Copyright 2026 Paperzz