Improving Spam Filtering by Training on Artificially Generated Text

DEPARTMENT OF COMPUTER SCIENCE
Improving Spam Filtering by Training on
Artificially Generated Text
Torgeir Sorvik
_______________________
A dissertation submitted to the University of Bristol in accordance with the
requirements of the degree of Master of Science in the Faculty of Engineering
September 2004 | CSMSC-04
Abstract
This dissertation is devoted to investigate whether spam filtering can be improved by
training the spam filter on artificially generated text. In addition, ROC curves are
introduced as an excellent metric of performance in spam filtering. ROC analysis is used
to show how a Markovian spam filter is better than a Bayesian spam filter. ROC analysis
is also used extensively to compare classifiers when training on artificially generated text.
In binary classification there is a strong correlation between classifier performance and
amount of training data. By analyzing the available training text and maintaining a
probabilistic model of the text, additional text can be generated based on the model. The
training text is augmented to the original text in a process referred to as interpolating. The
interpolation of training data is proven to improve the performance of spam filters based
on a naïve Bayesian classifier. The improvement is found to be due to a reduction in over
fitting and a less extreme mapping of spam probabilities. Additional experiments are
carried out to investigate whether spam filtering can be further improved by adding text
to the dataset which has been generated to deliberately be hard to classify. However, no
additional improvement is found compared to the basic text generator.
II
Acknowledgements
This paper is dedicated to my class mates of MS51 2003/2004 whose optimism and
assistance was a great help to me through the tough times. I also wish to thank my wifeto-be Hilde for her patience and support
I would like to thank Tim Kovacs for supervising this project and guiding me in my
work. Great thanks also to Peter Flach for his input and guidance on ROC curves.
III
Declaration
This dissertation is submitted to the University of Bristol in accordance with the
requirements of the degree of Master of Science in the Faculty of Engineering. It has not
been submitted for any other degree or diploma of any examining body. Except where
specifically acknowledged, it is all the work of the Author.
__________________________
Torgeir Sorvik, September 2004
IV
Table of contents
Abstract ...............................................................................................................................II
Acknowledgements........................................................................................................... III
Declaration........................................................................................................................ IV
Table of contents................................................................................................................ V
1 Motivation, aims and objectives ..................................................................................- 1 2 Introduction to spam filtering ......................................................................................- 2 2.1 Email communication and the spam problem...................................................- 2 2.2 Why spam is bad...............................................................................................- 5 2.3 History and development of spam filters ..........................................................- 6 3 Measuring performance in spam filtering....................................................................- 8 3.1 Cost and bias .....................................................................................................- 8 3.2 Common metrics...............................................................................................- 9 3.3 Roc curves.......................................................................................................- 11 4 Practical spam filtering ..............................................................................................- 16 4.1 Dataset.............................................................................................................- 16 4.3 Bayesian spam filtering...................................................................................- 20 4.4 Different Algorithms.......................................................................................- 22 4.5 Bayesian versus Markovian spam filters ........................................................- 27 5 Using artificially generated text to improve spam filtering .......................................- 35 5.1 Basic idea of Interpolating ..............................................................................- 35 5.2 Tools used .......................................................................................................- 38 5.3 Interpolating training data with artificially generated text. ............................- 40 5.4 Generating tougher text...................................................................................- 46 6 Conclusion and further work .....................................................................................- 52 References.....................................................................................................................- 54 Appendix 1: Source code
Appendix 2: Poster
V
1 Motivation, aims and objectives
The inspiration to investigate the effect of training a spam filter on artificially
generated text was sparked by the paper “Co-Evolving Parasites Improve Simulated
Evolution as an Optimization Procedure” by W.D Hillis (1990). Hillis improved a
process by training it on test cases that had been modified using feedback from the
process to make them harder to solve. The intended investigation in this dissertation is
to examine if spam filters can be trained using test cases that have been tailored to be
hard, and which effect such procedure might have on spam filtering performance.
The aims and objectives project are set to be:
• Primary Aim: investigate if spam filtering can be improved by training a spam
filter on data generated for this specific purpose
o Objective: Construct a system consisting of a spam filter and a spam
generator and measure the effect of introduce generated text as input to
spam filter.
• Secondary Aim: Investigate the effect of generating text using a
Markov Model which has been altered to produce messages harder to
classify.
o Objective: modify the system in order to generate difficult text
based on feedback from spam filter.
• Secondary Aim: Introduce ROC analysis as a metric for measuring
spam filtering performance.
o Objective: Use ROC curves to document experimental results
and optimise performance.
• Secondary Aim: Compare the performance of Bayesian and Markovian
spam filters.
o Objective: Measure if the experiments carried out in the
primary aim have different effects on Bayesian and Markovian
spam filters.
The aims are discussed in the order which makes the report most readable. The
secondary aim of introducing ROC analysis is presented first since the tools presented
in this discussion is used extensively when analysing the findings in the other aims.
The aim of comparing performances between Bayesian and Markovian spam filters is
not a prerequisite for the discussion of the main aim but some interesting differences
in behaviour are discovered and discussed.
These aims and objectives are the same as presented in the interim report and all of
them have been satisfactory carried out.
Improving Spam Filtering by Training on Artificially Generated Text
-1-
2 Introduction to spam filtering
This chapter is intended as an introduction to email communication, the spam problem
and the solutions attempted to prevent spam from finding its way into inboxes.
2.1 Email communication and the spam problem
Email communication has many obvious advantages compared to traditional postage.
This section will explore some of these and show why these advantages have attracted
unwanted attention threatening to reduce the usability of electronic mail.
This paper focuses on the phenomenon of unsolicited bulk email, also known as junkmail and spam. Spam will be defined as by (Spamhaus, 2004), any email where:
1. the recipient's personal identity and context are irrelevant because the message
is equally applicable to many other potential recipients, and
2. the recipient has not verifiably granted deliberate, explicit, and still-revocable
permission for it to be sent
In this project, email that is not spam but 'good' mail will be called ‘legit mail’ or
'ham'.
Most people living in the western world are familiar with email communication either
from personal use or through common knowledge. This type of communication has
become a cheap, fast and reliable way of communicating. The effort it takes to
compose and send a message is very low and people are sending many more emails
then letters using regular post. As an example: in mid-2003 the servers in the
Microsoft Hotmail network received 3 billion (109) emails every day
(www.microsoft.com).
So what is an email message? Technically, email messages are semi-structured text
documents. An email message consists of a header and a body. The header serves
much of the same purpose as envelopes do in traditional mail; it contains information
about the sender and most importantly the receiver. In addition the header carry
information about any intermediate stops the message has taken, timestamp for
sending and receiving, email-client used, and a subject. The header of a typical email
contains much information, an average of 1 KB pr message in the data used for this
project. The body of an email can contain text either as plain text or HTML. In
addition, MIME objects can be inserted. A MIME (Multimedia Internet Mail
Extension) object can represent non-ASCII text, multipart message bodies,
multimedia and other documents. Examples of MIME objects much used are JPEG
pictures which occur frequently in emails.
As already mentioned, email messaging has several properties making it quick and
easy to use. The cost involved in transmitting one individual email message from
sender to receiver is so small that it is often considered to be for free. Some simple
calculations will attempt to put some numbers to these claims.
Improving Spam Filtering by Training on Artificially Generated Text
-2-
The approximate cost of an email message might be:
• Electrical power to a computer, say 300Watt for five minutes with an
electricity fee of 10 p/KWh, a cost of 0.25 pence.
• A connection to the internet. For an ADSL line costing £30/month used in 10
seconds, a cost of 0.014 pence
All in all an estimated cost of the creation and transmission of an email message,
which typically would be available to the receiver within seconds, costs 0.25 pence.
A similar calculation for the costs involved with composing and sending a traditional
letter show:
• One sheet of paper from a WH Smith 150 sheet notepad: £1.99/150 = 1.33p.
• An A5 envelope from a WH Smith 25 package: 1.99/25 = 8p.
• A first class stamp at the post office: 28 p.
• Ink from a £2-pen which lasts for 100 pages of text: 2p.
The estimated cost of sending a letter, which normally takes one or two days within a
country or about a week to some other country, is 39.33 pence.
This simple calculation demonstrates the advantages of email communication; it is
much faster and according to the above calculations, 157 times cheaper (Commercial
users pay less for postage).
If we assume that the sender wishes to send the same message to all her 100 cousins,
the costs involved in sending the message by email are not affected: one message is
uploaded and distributed to 100 recipients. If the same communication is to be
achieved through royal mail, 100 sheets of paper, 100 envelopes and 100 stamps
would be needed. At this volume email communication would still cost 0.25 pence
while traditional post would cost £39.33, 15700 times more expensive.
The fact that email messages can be sent and received almost instantly to multiple
receivers has turned out to provide a great motivation for the miss-use of email as a
tool.
On May 3, 1978, the world's first spam message was distributed (Spia, 2003). The
sender, Gary Thuerk, used Arpanet to market a new series of computers. The incident
was not well received. The spam we know today started in the mid 1990s when
commercial interests started to post messages in newsgroups. People also discovered
how easy it was to mass distribute a message to all the email addresses they could get
hold of.
Today, professional spammers offer their services to organisations willing to pay a fee
of $250 for the dispatching of 1 Million messages (Yerazunis, 2004). Many
companies offer systems for gathering mail addresses from the web e.g. Lencom
Software Inc (www.lencom.com) and Spia (2003) assesses that a list of 25 Million
email addresses can be bought for $25.
Improving Spam Filtering by Training on Artificially Generated Text
-3-
The amount of spam messages being transported through the Internet keeps growing.
Currently (August 2004), spam is reported to make up 65% of all email traffic; see
figure 1, reproduced from (Symantec, 2004).
Figure 1, Historical spam amount
Improving Spam Filtering by Training on Artificially Generated Text
-4-
2.2 Why spam is bad
There are many reasons for why most people do not want spam messages in their
inboxes. The main reason on a individual level is usually that spam messages sent to a
large number of people are not likely to be of interest to most users. The spam is
simply an annoyance, forcing the user to delete many messages of no interest. From a
more objective point of view the objections to spam include:
•
•
•
•
•
•
Most spam contains fraudulent or
inaccurate information.
Deleting spam takes time. In a business
situation, time equals money.
Spam often carries offensive contents,
such as pornographic images or vulgar
language.
The large amount of data transferred in
spam emailing requires extra bandwidth
and storage capacity for users and ISPs.
Important mail might be accidentally
deleted when filtering spam (Heise, 2004).
Spammers often forge email-headers
(spoofing) and any one is at risk of getting
one's email address shown as sender in a
spam message with the risk of being
blacklisted.
Figure 2, spam content
Figure 2 show the distribution of different categories of spam measured in July 2004,
courtesy of (Symatec, 2004).
Spam affects individual persons, organisations and professionals. Though no
particular group can claim to have a significantly larger discomfort of spam than the
others, it is generally accepted that spam is becoming a particular great problem for
businesses. The spam received by businesses afflicts much cost to those companies.
As already mentioned, time equals money in the professional world. Time is wasted
when employees use their time to remove spam or when employees must receive
training in dealing with spam and the operation of anti-spam software. The cost of
handling spam is estimated at $20 billion US dollars, growing 100% every year (Spia,
2003).
As the amount of spam email traffic has grown, causing greater and greater
inconvenience for users, more effort has been put into research and development of
automated software capable of filtering spam email from legit email. These software
solutions, called spam filters are described in section 4.
Improving Spam Filtering by Training on Artificially Generated Text
-5-
2.3 History and development of spam filters
In the late 1990s, the world was starting to discover the Internet. Computers were
starting to become powerful, sporting 32bit processors and video-cards displaying
millions of colours. At the same time, modems went from being low baud faxhardware to 28.8Kb dial-up networking devices. Security was not an issue, and people
proudly posted their email addresses on various forums without giving it a second
thought. Early spammers easily compiled extensive databases of email addresses and
started the dispatch of primitive but effective messages.
Mail clients of the 1990s were also rather primitive and did not allow the user to
assign different actions to be executed based on the sender and content of a message;
there had been no need for such functionality.
Many early filtering solutions used simple Boolean queries to decide whether a
message was spam. These filters are referred to as rule-based and they typically
scanned the subject line for occurrences of 'sex' or 'rich' and the user was required to
manually define test queries. In 1996, (Cohen, 1996) suggested a system for
automatically learning rules to filter email. Rule-based learning is not considered to
perform satisfactory today, but some spam filters, e.g. SpamAssassin, still
supplement other technologies with rule-based classification (SpamAssassin, 2004).
During the last decade, several algorithms for spam filtering have been examined. In
their comprehensive paper: "Learning to Filter Unsolicited Commercial E-Mail",
Androutsopoulos et al. (2004) describe how algorithms such as C4.5, Ripper, knearest neighbour and Support Vector Machines have been tested and compared to
each other and naïve Bayes which is described in more detail in the next chapter.
Androutsopoulos also co-authored a paper where performance in email filtering was
improved by stacking several classifiers and using their combined predictive powers
to decide the most probable class (Sakkis et al., 2001). Stacking several classifiers has
the attractive property that the different algorithms stacked often have a significantly
different bias, resulting in uncorrelated errors, giving an overall more robust resulting
classifier.
Katirai (1999) used his MSc thesis to compare the performance of naïve Bayes and
Genetic Programming. His findings were that GP produced slightly worse results but
at a shorter processing-time. However, Katirai admits that the implementation of
naïve Bayes used was poor and inefficient.
The continuous development of new and improved spam filters is made necessary by
the fact that the people sending spam, the spammers, are constantly altering the
composition and style of the spam they send in an attempt to bypass spam filters. This
development has led to an arms race between spammers and spam fighters. The
changing style of spam often means that old spam filters are not tuned for the kind of
spam currently arriving.
When comparing literature on spam filtering from before and now, it is possible to
identify several features of spam which have changed. (Sahami et al., 1998) states that
spam messages usually don't have attachments, spam messages are typically sent at
Improving Spam Filtering by Training on Artificially Generated Text
-6-
night time and that spam messages are barely ever sent from an .edu, .uk, .no or other
similar domain. Today, very many spam messages carry binary data either as an
attachment or as a MIME part of the body. Also, very may spam message received
today have falsified headers showing innocent people, often based at the same domain
as the receiver, as sender. This forging of email headers is called "spoofing".
Sahami also reports with despair of a alarming situation where spam email constitutes
more than 20% of the content in a typical inbox. Considering the 65% which this
report operates with, it seems Sahami was better off not knowing what he had coming.
The above summary just goes to show that the story of spam filtering has not yet been
written and that no stone has been left unturned in the attempt to tackle spam by the
use of automated software.
Improving Spam Filtering by Training on Artificially Generated Text
-7-
3 Measuring performance in spam filtering
In this chapter, some metrics for measuring the behaviour of spam filters will be
discussed. The well-known and much used accuracy metrics is demonstrated to have
limitation when measuring performance in spam filtering. For a more powerful tool of
analysis, ROC curves are introduced and explained.
3.1 Cost and bias
Spam filtering is a text classification problem with some properties making it very
hard. Tom Fawcett summarised core issues in spam filtering like this (Fawcett, 2003a)
p1:
"skewed and changing class distributions;
unequal and uncertain error costs;
complex text patterns;
a complex, disjunctive and drifting target concept;
and challenges of intelligent, adaptive adversaries"
As already mentioned in section 2.2, the cost of deleting a legit message is much
greater then allowing a spam message to be displayed to the user. However, not even
this is a universal truth; if the inbox in question is a family computer or a computer
with very young users, some users might be willing to lose a legit message rather than
having very inappropriate content displayed. All in all, a spam filter is a tool which
must be customised to individual users according to their choice of costs assigned to
errors.
A distinct feature of spam is the class skew. The amount of received mail and the
spam/ham ratio varies a great deal from season to season, from day to day, and even
across the hours of the day. In his paper from 1998, Sahami reports that spam
messages are mostly dispatched at night time (US time).
Experiments performed in this project show that the class skew is a strong bias,
especially in earlier stages of learning. If the training set has a greater portion of one
class, the predictions are biased towards this class. The cause of this phenomenon is
explained through two observations. Some implementations of Bayesian spam filters
incorporate the prior probability (the spam/ham ratio) in the calculation of most likely
class. The other observation concerns how the tokens are kept track of; every
occurrence of a token increments a counter for the originating class. A larger amount
of data from one class causes counters to be incremented more, forcing ratios to be
increased.
The materials presented in this dissertation will show how ROC curves can be used as
a measurement when adjusting spam filters for an optimal performance given their
class skew and bias.
Improving Spam Filtering by Training on Artificially Generated Text
-8-
3.2 Common metrics
In binary classification, the goal is to predict the correct class of a test instance. In any
prediction made by a classifier there are four possible outcomes:
1. The instance is of class POS and the prediction is POS, called a True Positive.
2. The instance is of class POS and the prediction is NEG, called a False
Negative.
3. The instance is of class NEG and the prediction is POS, called a False
Positive.
4. The instance is of class NEG and the prediction is NEG, called a True
Negative.
Spam filtering is normally a binary classification problem for detecting spam. Hence,
a spam message is a positive instance and a legitimate message is a negative instance.
A confusion matrix (contingency table) for binary classification is shown in Figure 3.
In Machine Learning and Data Mining it is common to derive some measures from
the confusion matrix such as:
Assumed class
Recall =
Precision =
TP+TN
P+N
TP
TP+FN
TP
TP+FP
p
True class
Accuracy =
n
P
TP
FN
N
FP
TN
Figure 3
Normally when discussing the performance of any classification task we are mainly
interested in how often the classifier chooses the correct option. The metric which
measures the correct/not-correct ratio is Accuracy. To measure the amount of
misclassifications we use Error which is 1-Accuracy. Accuracy is a simple and much
used concept well suited for many tasks. However, in tasks with unequal costs
Accuracy does not give a realistic measurement of performance. This is best shown
with an example:
On a dataset with an equal amount of ham and spam, if the cost of deleting one
legit message is 50 times higher that the cost of not deleting a spam message, a
classifier which displays 90% of legit messages and 0% of spam messages and a
classifier which displays 100% of legit messages and 10% of spam messages are
assigned the same accuracy, 95%.
Since a cost-ratio of 50 means that the user prefers to read 50 spam messages rather
than having one legit message deleted, we see that the first of the two classifiers
performs badly but is still assigned a high accuracy. Some evidence suggests that
accuracy is not a suited metric for classification performance when error costs are
skewed (Provost et al., 1998).
Improving Spam Filtering by Training on Artificially Generated Text
-9-
One attempt to give a fair measurement of performance with skewed costs is to use
Weighted Accuracy (Androutsopoulos et al., 2000b) which is defined as:
WAcc =
λ ⋅ nlegit →legit + n spam→ spam λ ⋅ TP + TN
=
λ ⋅ N legit + N spam
λ ⋅P+ N
where λ is the weight ratio FP/FN λЄ(0,∞).
The Weighted Accuracies from the example above are 91.66% for a classifier which
displays 90% of legit messages and 0% of spam messages 98.33% for a classifier
which displays 100% of legit messages and 10% of spam messages. These calculation
suggest that the first classifier is performing much worse than the other given the
current cost ratio. If we calculate WErr = 1–Wacc we see that the second classifier
has a five times greater error than the first classifier.
The metrics of Precision and Recall are commonly used in information retrieval
(Lewis, 1991, Fisher et al., 2004). Whilst accuracy is calculated from both types of
errors, recall and precision concentrates on measuring true positives. This ability is
useful when the costs of errors are skewed. Precision measures the ratio of correct
positive predictions while recall measures the fraction of positive instances classified
correctly. By maximising the precision, the focus of the classification task becomes to
minimise the amount of false positives. By maximising the recall, the focus turns to
minimising the amount of false negatives. Precision and recall are much used in spam
filtering literature. In spam filtering a user must decide a error costs and determine a
suitable trade-off between maximising precision and recall. Figure 4, reproduced from
(Fawcett, 2003b), shows a precision-recall graph for two classifiers on a test set with
equal skew.
Figure 4 precision-recall graph
Improving Spam Filtering by Training on Artificially Generated Text
- 10 -
3.3 Roc curves
In an attempt to overcome some of the weaknesses of the metrics discussed above, we
now introduce ROC curves. ROC (Receiver Operating Characteristics) curves
originate from signal detection theory where scientists needed to depict the tradeoffs
made between accepting a signal as valid data and discarding it as noise.
A ROC curve is a set of points with coordinates [FP,TP]. ROC curves can also be
drawn in n-dimensional space but this is not relevant for this discussion.
ROC curves have received much attention in the machine learning community lately
due to their ability to measure performance of classification without being affected by
costs skew and class skew like the metrics discussed in section 3.2 (Furnkranz and
Flach, 2003). However, there is still some discussion to whether or not ROC curves is
the best tool for classification with skewed error costs (Drummond and Holte, 2004).
From equation (4) we saw that the naïve Bayesian approach to binary classification is
to estimate the probability of both classes and choose the class scoring highest. Naïve
Bayes is a "ranker"; it assigns a continuous score to an instance through the ratio of
predicted probability of spam/ham. To turn a ranker into a classifier, a decision
threshold is set. Any instance receiving a score greater than the decision threshold is
classified as a positive instance. When drawing the ROC curve of a ranker, as in
figure 5a, we sort the instances by their score and traverse the list, drawing one point
for each instance. The ROC graph will now show all possible classifiers that can be
created by choosing the decision threshold to be one of the scores.
Figure 5a shows the ROC space with a ROC curve drawn though four points in the
space. The four points represent different decision thresholds in a naïve Bayesian
classifier.
α
1.0
0.95
1.0
1.0
0.95
1.05
0.8
β
True Positive rate
True Positive rate
0.9
1.1
0.6
0.4
0.2
0.8
1.05
0.7
1.1
0.6
0.0
1.0
γ
0.5
0.0
0.2
0.4
0.6
False Positive rate
Figure 5a, ROC space
0.8
1.0
0.5
0.6
0.7
0.8
0.9
1.0
False Positive rate
Figure 5b, Optimal classifier
The diagonal line X=Y represents a behaviour where classification is random. The
point (0,1) in the top left corner of ROC space is called ROC heaven and any
Improving Spam Filtering by Training on Artificially Generated Text
- 11 -
classifier in this point is capable of separating the classes perfectly. At the other
extreme, in the point (1,0) we find ROC hell which always chooses the wrong class.
Since a classifier in ROC hell always chooses wrongly, the predictions can simply be
negated to create a classifier in ROC heaven; there is symmetry around X=Y and all
classifiers should in practice lie on the heaven side of X=Y.
Classifiers to the lower left in ROC space are said to be Conservative (Fawcett,
2003b); they require much evidence to make a positive prediction, leading to few
false positives. On the other hand, this leads to few true positives as well. Classifiers
in the top right corner are said to be Liberal; they require less evidence for labelling
an instance as positive, leading to a higher FP-rate. Spam filtering is typically a
conservative task since the cost of a false positives is much greater that to that of false
negative.
Since the ROC curve of a spam filter contains the information about the expected
errors for any classifier, it is possible to choose the Optimal Classifier for any cost
situation. The example in section 3.2 demonstrated how we might want to assign a
score to false positives 50 times greater than false negatives. To find the optimal
classifier given this costs, a line of slope 1/50 (inverted) is slid from ROC heaven
towards ROC hell until it touches a point on the curve, see figure 5b. This point
represents the optimal classifier given the cost specified. Such lines are called isoaccuracy lines and will connect any ROC-points with the same accuracy. In spam
filtering, we wish to minimise the number of false positives and subsequently will
typically use a very steep iso-accuracy line. In figure 5b a such slope is demonstrated
as a line with angle γ. In another setting, the cost ratio between FP and FN might be
smaller, demonstrated by a line with angle β. If the main goal of the classification was
to avoid FN errors, a line with angle α might be the best option.
As demonstrated in section 3.2, not all metrics give a fair evaluation of how well a
classifier separates classes. A naïve Bayesian spam filter is in effect an indefinite
number of classifier because any chosen threshold turns the ranker into a classifier.
By measuring the Area Under an ROC Curve (AUC) we can express the filter's ability
to separate classes for any applicable decision threshold. Formally, the AUC of a
classifier is equal to the probability that a classifier will assign a higher rank to a
random positive instance than to a randomly chosen negative instance(Fawcett,
2003b). The maximum area obtainable for AUC is 1; the area of a unit square.
AUC=1 is found on spam filters which separate perfectly between classes. The
expected AUC of a random classifier is 0.5.
Figure 6a and 6b illustrate how the AUC of a naïve Bayes spam filter might look
when the filter has received: a) little training and b) more training.
In figure 6b, the ROC graph is closer to the wall of the ROC space in most points;
therefore the shaded area is greater than that of figure 6a. When choosing an optimal
classifier from the two graphs, it is likely that the classifier selected in figure 6b will
perform better than that of figure 6a.
Improving Spam Filtering by Training on Artificially Generated Text
- 12 -
Figure 6a, AUC
Figure 6b, AUC
When performing cross validated experiments it is possible to draw averaged ROC
curves as described by (Fawcett, 2003b). However, the experiments carried out in this
dissertation typically have different test sets between runs and the difference in class
skew excludes averaging techniques based on fixed error intervals. Instead, roc curves
of single but representative runs are used and discussed based on general trends
observed.
For the remainder of this report it is assumed that a spam filter with a higher AUC
than another is in general preferable. This assumption is based on a averaged score to
avoid good classifiers being discarded due to a lower AUC caused by concavities. The
issue of concavities in ROC graphs (Flach and Wu, 2003) and the Convex Hull in
general (Flach, 2004, Fawcett, 2003b) will not be discussed in this report.
Improving Spam Filtering by Training on Artificially Generated Text
- 13 -
3.4 Tools used
In order to perform ROC analysis on spam filters, some tools for storing, measuring
and drawing ROC graphs have been implemented in ROCGraph.java.
Each ROC point is represented as an object implemented in a inner class called
ROCPoint containing its true class, f-score and a identifier. The choice of representing
points as objects comes natural since the points need to be sorted before any
operations can be performed on the graph.
Three basic functionalities have been implemented:
(1) drawROCGraph():
Display a panel containing a ROC graph using ROCOn, available from
http://www.cs.bris.ac.uk/Research/MachineLearning/rocon/index.html
The calculations are based on Algorithm 2 in (Fawcett, 2003b) with a few
modifications:
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
Sort the points in the set by decreasing f-scores
Set N, P = # of negative/positive examples,
Set FP, TP = 0
roc = new ROCOn object
For all points i1,i2,…,ik in set
if(fi ≠ FPrev)
roc.addPoint += (FP*100/N, TP*100/P)
FPrev = fi
if(i is a positive example)
TP++
if(i is a negative example)
FP++
roc.show()
The figures used in this report are not generated by the drawROCGraph function, but
generated from a spreadsheet manager using TP and FP as coordinates. The
coordinates are printed from the getOptimalClassifier function
(2) getAUC():
Calculate the area under the curve using approximation by trapezoids.
The calculations are based on Algorithm 3 in (Fawcett, 2003b) with a few
modifications:
1:
2:
3:
4:
5:
6:
7:
8:
9:
Sort the points in the set by decreasing f-scores
Set N, P = # of negative/positive examples,
Set Area, FP, TP, FPprev, TPprev = 0
For all points i1,i2,…,ik in set
if(fi ≠ fi-1)
Area += Trapezoid(FP, FPprev, TP, TPprev)
fi-1 = fi, FPprev = FP, TPprev = TP
if(i is a positive example)
TP++
Improving Spam Filtering by Training on Artificially Generated Text
- 14 -
10:
11:
12:
13:
14:
if(i is a negative example)
FP++
Area += Trapezoid(N,FPprev,P,TPprev)
Area /= PN
Return Area
(3) getOptimalClassifier(slope, direction)
As demonstrated in figure 5b, the optimal classifier for any given cost ratio can be
found from geometric analysis. The function has two parameters:
• The slope of the iso-accuracy. The slope represented by X/Y to allow the slope
to reach a vertical angle when X/Y = 0.
• Direction. The function allows for traversal in both directions; from (0,0)
towards (1,1), called Liberal and from (1,1) towards (0,0), called
Conservative.
The choice of names in the direction comes from the observation that the algorithm
replaces the best known point only if another point is better. In the case where several
points lie on the same iso-accuracy line, the first point encountered will be returned.
By traversing towards (1,1), the most liberal point will be chosen. By traversing
towards (0,0), the most conservative point is returned.
The algorithm for finding the optimal classifier (only the liberal direction is shown for
simplicity):
1:
2:
3:
4:
5:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
Direction and Slope received from method call
Sort the points in the set by decreasing f-scores
Set P = number of positive examples,
Set FP, TP, OptimalFP, OptimalTP = 0
Set BestSoFar = 1;
For all points i1,i2,…,ik in set
Goodness = FP + (P-TP)*slope
if(Goodness < BestSoFar)
BestSoFar = Goodness
OptimalFP = FP, OptimalTP = TP
fi-1 = fi
if(i is a positive example)
TP++
if(i is a negative example)
FP++
Return (OptimalFP,P-OptimalTP)
Improving Spam Filtering by Training on Artificially Generated Text
- 15 -
4 Practical spam filtering
In this section different aspects of practical spam filtering will be presented and
discussed. The algorithms presented forms the basis for all experiments in this project.
4.1 Dataset
Spam filtering is a machine learning/data mining task where information extracted
from a data set is used to classify unseen documents. Email datasets contain
information which can reveal a lot about the persons implicated in messages. Due to
the sensitivity of personal email communications, very few datasets containing email
are available. However, some datasets are publicly available based on voluntary
submissions by the public to a database . These data sets are not subject to any
restrictions even though they might also contain some sensitive data. The
disadvantage of datasets based on submissions is that they consist of messages written
to and from a large number of different people and are not a realistic approximation to
a personal inbox.
Some datasets contain email messages which have been anonymised through
encoding (Androutsopoulos et al., 2000a).
Due to the above mentioned limitations of publicly available datasets, a new personal
dataset has been compiled for this project. The data consists of email received on
accounts owned by the author. The mail has been processed by Ximian Evolution and
exported in text format. Unfortunately (or rather fortunately), spam and ham is not
received on the same account. Ham email is received to a University email account,
while another account residing at Yahoo receives mainly spam. The main differences
between messages from the two accounts are a few entries in the headers. Creators of
email have great freedom in what to put in headers according to the SMTP protocol
and massages received through the Yahoo POP3 server and the UoB IMAP server
contains a few custom additions which act as instantly giveaways.
For this reason, a program has been implemented to create datasets from text corpora.
The program MailTokeniser.java takes as input a text corpus and its classification.
The pre processing of the messages consist of:
1. messages are separated
2. distinctive header information is deleted
3. MIME attachments are removed
4. the messages are assigned a name and added to a collection.
The resulting dataset consist of 750 ham messages and 850 spam messages. The
messages are assigned names according to the order they arrived to preserve the
chronological ordering.
Due to privacy issues of the large number of senders/receivers included in the ham
messages the data set will not be made publicly available. However, the dataset can be
presented on request by the author in person.
Improving Spam Filtering by Training on Artificially Generated Text
- 16 -
4.2 Training of spam filter
The process of training a spam filter consists of tokenising the training data and
maintaining two counters for each token; the number of times the token has occurred
in spam and ham messages. In the experiments performed in this project the number
of different tokens occurring in the corpus is very high and convergence occurs late.
The growth of the vocabulary-size when trained on 800 spam messages is shown in
figure 7.
Size of Vocabulary
60000
Distinct Tokens
50000
40000
30000
20000
10000
0
0
100
200
300
400
500
600
700
800
Spam Messages
Figure 7, growth of vocabulary
Surprisingly, the growth of the vocabulary is almost linear. Intuitively one would
expect there to be a lot of redundant words shared by many messages. The
explanation for this behaviour is related to an observation made in section 2.1: a
typical email message can contain 1000 Bytes of header data and only 100 Bytes of
body data. Header information contains a lot of unique information and in the
vocabulary a large number of mappings point to tokens occurring only once. In
addition, spam messages often contain random text and sometimes non-valid words in
a dictionary salad intended to confuse Bayesian spam filters. If feature pre-processing
techniques such as stemming were applied the growth of the vocabulary might have
converged after a shorter period.
Several spam filtering algorithms use only a subset of all possible tokens to classify
unseen data; the mapping from features to counts consists only of a set of tokens
chosen in a feature selection process. (Forman, 2004) claims:
"Feature selection is necessary to make the problem tractable for a
classifier. Well-chosen features can improve substantially the classification
accuracy, or equivalently, reduce the amount of training data needed to
obtain a desired level of performance. It has been found that selecting
features separately for each class, versus all together, extends the reach of
induction algorithms to greater problem sizes having greater levels of class
skew" p. 4.
Several spam filter models include a feature selection process in the training phase
(Sahami et al., 1998, Androutsopoulos et al., 2000b). Feature selection is the process
of measuring the effectiveness (Information Gain/Mutual Information) of an attribute
Improving Spam Filtering by Training on Artificially Generated Text
- 17 -
in classifying the training data (Cover and Thomas, 1991). Typically, a subset of ≤
500 of the best features is used in classification. Others have suggested that it is
preferable to maintain a mapping of as many attributes as possible to ensure that any
token found in a message will be assigned a probability (Graham, 2003).
In the work done during this dissertation, the best results have been achieved using
Graham’s approach. Also, a real spam filter does not have a fixed training set, but
uses whatever mail is received to classify new mail. To use feature selection in this
environment, the information gain for the entire vocabulary would need to be
recalculated for every time training had been received.
The program IG.java implements feature selection as described by (Sahami et al.,
1998).
To assess the discriminating abilities of features, their Mutual Information is
calculated:
MI ( X ; C ) =
∑
X ∈{0 ,1}, C∈{ spam , ham }
P ( X = x, C = c ) ⋅ log
P( X = x | C = c)
P ( X = x ) ⋅ P (C = c )
When spam filtering was performed using IG.java with a vocabulary reduced to a size
of e.g. 1000 or 500 features, the performance of the spam filter was reduced from
0.994 to 0.987 and 0.986 respectively. Most of the performance loss was caused by a
increase in false negatives which makes the method inappropriate for spam filtering.
Feature reduction did not improve the performance of spam filtering, but will be
revisited in section 5.3 (training on semi-random text) where it used to identify highentropy features.
As already mentioned, a real-life spam filter does not have a static training set and test
set. In (Mitchell, 1997) which is a well cited reference for text classification, it is
assumed that a training set and a test set are presented. Also, it is assumed that the
training set and the test set are drawn from the same text corpus and shuffled in a tenfold cross validation.
Firstly, real-life spam filtering is an on-line learning task; only one attempt is given in
correctly classifying incoming messages. Secondly, it has been shown that the
chronological order of the email messages should be respected to achieve optimal
performance in spam filtering experiments (Fu and Silver, 2004). These observations
suggests that extra care should be taken when planning experiments using cross
validation.
There are several ways to train a spam filter, each with a different resulting
performance:
1. Fixed training set, no information is added after initial training.
2. Iterative training set, all documents are added to mapping.
3. Iterative training set, erroneously classified messages are added to mapping.
4. Train until no errors, training is repeated until perfect result.
Improving Spam Filtering by Training on Artificially Generated Text
- 18 -
Figure 8 shows the relationship between training set and test set in filters with fixed
training set and iterative training set.
Fixed training set
Iterative training set
Size
Train
Test
Test
Train
Time
Figure 8, training schemes
In practical spam filtering, train until no errors is reported to produce best results but
at a high computational cost (Yerazunis, 2004). In experiments performed for this
project, (1), (2) and (3) are tested with (3) producing the best results.
Classification using a fixed training set (1) is not well suited for a online learning task
such as spam filtering but very useful when measuring how good a spam filter is at a
given time without modifying the mapping further. In all experiments carried out in
chapter 5, a fixed training set is used.
Improving Spam Filtering by Training on Artificially Generated Text
- 19 -
4.3 Bayesian spam filtering
Since the last half the 1990s naïve Bayes have been recognised as a powerful spam
filter and most other classifiers have been tested using naive Bayes as a baseline
comparison.
In Bayesian text classification, the idea is to decompose text into independent words
which are counted according to their origin. When new text is to be classified, the
previous appearances of each word in the text during training are used to assign a
likelihood of this word belonging to a document of a given class. By combining the
prediction of each word in a document, the most probable class can be calculated.
The naïve Bayesian spam filter gets its name from the way it calculates probabilities
using Bayes theorem:
P (Class | Message) =
P(Class ) P( Message | Class )
P( Message)
(1)
Where P(Class|Message) is the probability of Message, a vector of tokens v1,...,vk,
belonging to Class. P(Class) is the prior probability of a message being Class
calculated from the ratio of Spam/Ham received. P(Message|Class) is the likelihood
of observing Message with the label Class. P(Message) is a normalising factor.
The expression P(Message|Class) involves computing one probability for each word
contained in a message, each possible word in the vocabulary and each possible class.
For a message containing 500 words, using a vocabulary of 100,000 distinct words
and with 2 possible classes, this would involve the calculation of 100 Million terms
(Mitchell, 1997). This would be impractical if not impossible to compute in
reasonable time. We therefore assume that any word wi occurring in a document of
class C is independent from any other word and that all words are evenly and
identically distributed in the document. With this assumption, we need only to
calculate 2 * 100,000 = 200,000 terms. This simplification is called the naïve Bayes
assumption. By using the naïve Bayes assumption (1) is simplified to:
∏
P (Class | Message) = P(Class ) ×
k
i =1
P(vi | Class )
(2)
P( Message)
When using (2) to classify messages we can ignore the invariant normalising factor:
k
Classification = arg max P (C = c) × ∏ P(vi | c)
c∈{ spam , ham}
(3)
i =1
Since spam filtering is a binary classification problem we can use (3) to calculate a
ratio:
P(C = spam | Message) P(C = spam) × ∏i =1 P(vi | spam)
=
k
P(C = ham | Message)
P (C = ham) × ∏i =1 P(vi | ham)
k
Improving Spam Filtering by Training on Artificially Generated Text
(4)
- 20 -
(4) is normally re-factored for simplicity and altered to use the sum of logs in order to
avoid arithmetic underflow which is inevitable when taking the product of an
indefinite amount of normalised probabilities, giving:
spam − ratio = ln
P(vi | spam)
P( spam) k
+ ∑ ln
P(ham) i =1 P(vi | ham)
(5)
The resulting expression labels a message as spam if it has a spam-ratio grater than a
constant λ, set either at a default value e.g. 1.0 or calculated from an optimalclassifier-search on a ROC graph like shown by (Lachiche and Flach, 2003) and
implemented in section 3.4.
The naïve Bayes assumption is not justifiable in natural text which follows rules of
grammar and content. However, naïve Bayes classifiers have time after time proved to
classify with great accuracy despite this erroneous assumption (Sahami et al., 1998,
Androutsopoulos et al., 2000a, Graham, 2002, Domingos and Pazzani, 1997). Efforts
have been made to minimise the error of the independence assumption; (Rennie et al.,
2001) describes techniques to reduce data bias and weight magnitude errors.
Many modifications can be made to a basic naïve Bayes spam filter and several
different ways of estimating P(vi|class) have been applied successfully.
In spam filtering, one of the most critical parameters in classification seems to be in
pre-processing. The spam filtering community is not unified in how to select the best
features from a message. Some prefer to calculate the information gain from an
attribute and choose a subset which has the greatest ability to discriminate between
classes (Sahami et al., 1998). Paul Graham prefers to include the 20 most extreme
probabilities in his calculation (Graham, 2003), while Bill Yerazunis attempts to
minimize the error caused by the naïve Bayes assumption by using token-chains
where words have dependencies between them as features (Yerazunis, 2004).
Different practical aspects of spam filtering are explored more thoroughly in section
4.4.
Improving Spam Filtering by Training on Artificially Generated Text
- 21 -
4.4 Different Algorithms
A large number of methods have been derived from the basic idea of filtering text or
email using naïve Bayes probability estimation. One might argue that some solutions
are more correct than other, but since so many different solutions have been found to
perform excellent spam filtering it does not seem necessary to attempt producing a
model answer.
For this dissertation, five different models have been implemented. The models differ
in how they count features, how they estimate probabilities and in how features are
defined. This section and the next will use the tools described in section 3 to analyze
and compare the performance of different algorithms.
A large number of spam filters are available, both commercially and as open source
projects. Around 40 are listed on http://paulgraham.com/filters.html.
The different filters vary much in implementation. Spam filtering is a computationally
expensive task and most filters are implemented in C or C++ even though some
implementations are available in Python, Perl or Java. Some filters treat messages as
vectors of feature-counts for well known features (Sahami et al., 1998) while others
use the entire message in classifying (Yerazunis, 2004). Some implementations put
tokens into hash tables according to their origin while others define their own data
structures. There are different opinions on how to tokenise messages and how to
threat html elements. Graham (2003) treats the same token differently according to
where in the message it is encountered in order to produce a larger number of
features. The commercial PureMessage renders embedded html in order to analyse
what the receiver will see when the html is displayed (Sophos, 2003).
Due to the variations in implementations of spam filters, it is difficult to compare
different algorithms with respect to performance and computational cost. One of the
secondary aims of this project is to perform a fair comparison of filters using single
words as features and filters using word-chains (n-grams) as features. The latter will
be referred to as a Markovian spam filters. This difference is explored in the next
section.
All implementations in this project are written in the Java programming language.
The data structure containing the mapping of features is an ad hoc structure where
each distinct feature is represented by an object containing statistics for the
occurrences of the feature. The objects are stored in a binary tree implemented
through the java.util.TreeMap data structure and offers O(log n) access time to the
objects.
The five implementations are:
• BayesianSpamFilter. An implementation of what is commonly perceived as
the standard way of filtering spam.
• MitchellMarkov. Implements the counting and calculations in (Mitchell, 1997)
adapted to spam filtering.
• PaulGraham. Implements "A Plan For Spam" by Paul Graham (2002). Graham
calculates P(C|vi) rather than P(vi|C) as local probabilities.
Improving Spam Filtering by Training on Artificially Generated Text
- 22 -
•
•
IG. Classifies on a reduced number of features by choosing only highinformation features as suggested by (Sahami et al., 1998).
MarkovianSpamFilter. Uses token-chains rather than words as features.
The filters use a similar algorithm to perform learning and classification, Figure 9.
abc
xyz
training set
put()
tokeniser
Training
a 0.2
b 0.7
c 0.03
mapping
Classifying
abc
xyz
message
get()
tokeniser
a 0.2
b 0.7
c 0.03
d 0.99
e 0.87
mapping
∑
%
calculations
spaminess
Figure 9, training and classifying in spam filtering
When comparing the performance of the different spam filters it is useful to compare
their optimal classifiers in order to see if one filter is more or less conservative than
another. When performing cross validated experiments we need some way to average
the optimal classifiers produced in the different runs. The coordinates of the averaged
optimal classifiers are calculated from:
numFolds
numFolds
X =
∑ Xi
i =1
numFolds
Y=
∑Y
i =1
i
numFolds
Two experiments were performed. First the filters were tested using cross validation
from a data set consisting of a training corpus containing 400 spam messages and 400
ham messages and a test set containing 450 spam messages and 350 ham messages.
The cross validation used is not a typical cross validation where the data is partitioned
and some partition(s) are withheld for test set while the others are used as training set.
Instead a number of messages are randomly selected from a larger number as
participants. The cross validation algorithm used for experiments in this dissertation is
described in section 5.3. The filters performed 30 runs where the training set and test
set both consisted of 400 messages picked from their respectable corpora. The
averaged results are shown in table 1. The field “Optimal classifier” shows the FP and
FN errors of the classifier chosen by sliding an iso-accuracy line with slope of 0.15
(6.67 in a X/Y grid) from ROC heaven towards the ROC curve (figure 5b) .This
represents a cost ratio where a FP is regarded as 6.67 more expensive than a FN.
Improving Spam Filtering by Training on Artificially Generated Text
- 23 -
Implementation
BayesianSpamFilter
MitchellMarkov
PaulGraham
IG
MarkovianSpamFilter
AUC, fixed data
0.9921
0.9897
0.9745
0.9864
0.9954
Optimal classifier
3,3/12
4,6/9,3
6,5/6,2
4,8/17,7
2/13
Table 1, cross validated performances
From table 1 it is clear that there are differences in how the spam filters perform. To
check if the differences were statistically significant the AUCs were checked for
differences using a paired-sample T-test using SPSS 10.1 (www.spss.com).
The results shown in figure 10 demonstrate how the different filters perform
significantly different. However, there is little difference between BayesianSpamFilter
and MitchellMarkov and between MitchellMarkov and IG.
Paired Samples Test
Pair 1
Pair 2
Pair 3
Pair 4
Pair 5
Pair 6
Pair 7
Pair 8
Pair 9
Pair 10
Mean
BAYESIAN - MITCHELL ,002446
BAYESIAN - GRAHAM ,017655
BAYESIAN - IG
,005719
BAYESIAN - MARKOV -,00331
MITCHELL - GRAHAM ,015209
MITCHELL - IG
,003273
MITCHELL - MARKOV -,00576
GRAHAM - IG
-,01194
GRAHAM - MARKOV
-,02097
IG - MARKOV
-,00903
Paired Differences
95% Confidence
Interval of the
Std.
Std.
Difference
Deviatio
Error
Lower
Upper
n
Mean
,0052879 ,0009654 ,000471 ,004420
,0150594 ,0027495 ,012031 ,023278
,0066905 ,0012215 ,003220 ,008217
,0043069 ,0007863 -,00492 -,00171
,0155019 ,0028303 ,009420 ,020997
,0071545 ,0013062 ,000601 ,005944
,0046868 ,0008557 -,00751 -,00401
,0153817 ,0028083 -,01768 -,00619
,0152717 ,0027882 -,02667 -,01527
,0068122 ,0012437 -,01158 -,00649
t
2,533
6,421
4,682
-4,2
5,374
2,506
-6,7
-4,3
-7,5
-7,3
df
29
29
29
29
29
29
29
29
29
29
Sig.
(2-tailed)
,017
,000
,000
,000
,000
,018
,000
,000
,000
,000
Figure 10, Paired-sample T-test
The second experiment looks in more detail in where the algorithms differ. The five
filters were set to process the data set of 1600 messages without cross validation.
When the filters were allowed an initial training phase, the training set consisted of
500 messages, 250 of each class. The test set contained 500 ham messages and 600
spam messages. When the filters where allowed to learn from their mistakes, the
whole data set was used. The MarkovianSpamFilter used 4-grams as features. The
resulting performances are shown in Table 2.
Improving Spam Filtering by Training on Artificially Generated Text
- 24 -
Implementation
AUC,
fixed data
BayesianSpamFilter
0.9895
MitchellMarkov
0.9894
PaulGraham
0.9878
IG
0.9881
MarkovianSpamFilter 0.9982
Optimal
classifier
17/43
17/44
10/27
12/84
3/15
AUC, Train
On Errors
0.99186
0.99151
0.9303
n/a
0.9997
Optimal
Classifier
5/19
6/16
42/55
n/a
0/4
Table 2, spam filter performances
From Table2 we see that each filter have its own characteristics and behave
differently in the two types of training. Of the four filters using simple tokens as
features (uni-grams), the BayesianSpamFilter has the highest AUC in both types of
training (not cross validated). Looking at the first optimal classifier, the PaulGraham
filter produces a very good result with only 10 FP and 27 FN. This is clearly a much
better result than the 17/43 of the BayesianSpamFilter when considering a cost ratio
of 6.67 but yet the latter has a higher AUC. To find out why the two performance
indicators do not suggest the same ordering of the two classifiers we must study their
ROC curves, Figure 11.
Figure 11a BayesianSpamFilter
Figure 11b PaulGraham
By studying the difference between Figures 11a and 11b we notice that the
BayesianSpamFilter has very many points in the 'knee' of the graph (the section
between FP=0 and TP=600) while the PaulGraham filter has much fewer points in the
same area. This difference stems from how the probabilities are calculates in the two
algorithms. The PaulGraham implementation in this project estimates a normalised
probability that a message is spam based on Graham’s interpretation of the Bayes
theorem without considering the prior probabilities of the classes:
P ( spam | message) =
∏
∏
20
i =0
20
i =0
P(v i | spam)
P(v i | spam) + ∏i = 0 (1 − P(v i | spam))
20
Improving Spam Filtering by Training on Artificially Generated Text
(6)
- 25 -
The BayesianSpamFilter assigns a un-normalised score according to the
spaminess/haminess ratio observed. The distribution of the scores in the two
classifiers is shown in figure 12.
BayesianSpamFilter
Probability Distribution
PaulGraham
1.4
Spaminess
1.2
1
0.8
0.6
0.4
0.2
0
0
100
200
300
400
500
600
700
800
900 1000 1100
Message
Figure 12, probability distribution
Figure 12 clearly shows how the two filters differ in how they distribute probabilities.
The BayesianSpamFilter has a smooth transition from evidence of spam to evidence
of ham while the graph of PaulGraham is almost a step function, leaving few points
on the steep transition from 0 to 1. Due to this property, the PaulGraham filter is
unable to draw a curve with a smooth 'knee' like the other filters in the project and
does therefore tend to produce ROC curves with a low AUC. As a consequence of the
shape of its ROC curve, a PaulGraham filter can produce excellent results for some
cost ratios but poor performance under other conditions. This shortcoming can been
amended by using other methods for combining probabilities like chi-square (X2)
(Louis, 2003a).
The above analysis of two classifiers serves a good example of how the AUC can be
used as measurement of classifier robustness – its ability to perform well under
different conditions. After the ROC analysis it was possible to conclude that even
though the PaulGraham filter outperforms the BayesianSpamFilter for one particular
cost ratio, the BayesianSpamFilter is a more robust classifier and should be preferred
in real-life.
This way of choosing spam filter by ROC analysis is not common and has not been
observed in any spam literature by the author. It is hoped that the theory and tools
presented in this dissertation can motivate other people to apply ROC analysis to the
spam filtering problem.
Improving Spam Filtering by Training on Artificially Generated Text
- 26 -
4.5 Bayesian versus Markovian spam filters
It has been set as a secondary objective for this dissertation to compare the
performance of Bayesian spam filters and Markovian spam filters. The motivation for
this comparison is the contradicting evidence available to whether the Markovian
filter is superior to the Bayesian filter. Comparisons published do not offer an answer
to how a basic version of the two algorithms compare but rather uses benchmark
results from spam system packages publicly available, implemented in different
languages and with individual optimisations.
In this project, MarkovianSpamFilter is an implementation of a naïve Bayesian spam
filter where n-grams are used as features. An instance of MarkovianSpamFilter where
the order of the Markov chain (sentence length) is set to 1 is a standard naïve
Bayesian filter. This property allows for a very fair comparison of the two filtering
algorithms.
Sam Holden (2002a, 2002b) performed an extensive comparison of spam filters
including SpamBayes, based on Graham (2002, 2003) and the Markovian filter
CRM114 by (Yerazunis, 2004). Holden found several Bayesian spam filters to
outperform CRM114 while the creator of CRM114, Bill Yerazunis, has published
results where his filter outperform Bayesian filters (Yerazunis, 2004).
In their comprehensive study of spam filtering techniques (Androutsopoulos et al.,
2004) concluded that the use of n-grams as features rather than uni-grams did not
improve the performance of spam filters. Androutsopoulos et al. concluded that the
generated n-grams contained redundant information and did not add features of high
discriminating power. However, the experiments were performed on a Support Vector
Machine based spam filter system called Filtron (Michelakis et al., 2004), and the
results do not necessary apply for Bayesian filters where all features are used for
classification.
The motivation to develop Markovian spam filters is based on several observations
made from studying spam and from the assumptions necessary to create a naïve Bayes
classifier. As discussed in section 4.3, the naïve Bayes text classifier assumes that all
words in a text are independent from each other. This assumption motivates spammers
to attempt to fool Bayesian spam filters by disguising some words by splitting them
into fragments. Bayesian filters are also sensitive to 'dictionary salads' where
spammers add a large number of innocent or random words into the message in order
to push the total probability of a message towards legit.
The independence assumption of naïve Bayes can be compared to the perceptron
(Minsky and Papert, 1969) in that it can only solve problems that are linearly
separable e.g. it can learn to assign high spam probability to the features A and B but
it can not learn to allow the feature AB. Example: A doctor wishes to receive email
with information on a study undertaken on sexual education and contraception among
young people. An article might be called or contain the phrase “The naked truth about
teenage pregnancies”. The article would contain many phrases used in healthcare.
These phrases tend to share some words which may also occur in more vulgar
contexts. The presence of the words ‘naked’ and ‘teenage’ is a very strong indication
Improving Spam Filtering by Training on Artificially Generated Text
- 27 -
of spam contents of pornographic nature. The mentioned sentence would probably
trigger a Bayesian spam filter to block the message.
To overcome some of these weaknesses spam filters can use n-grams as features
where n is a number typically ≤ 5. An n-gram is treated as a Markov chain, hence the
name Markovian spam filter. A Markov chain is defined by: If in a sequence of n
trials X1, X2…Xn, the outcome of any trial Xk (2 < k < n) depends only on the
preceding trials, the sequence is said to have the Markov Property and is called a
Markov Chain.
An n-gram is basically a sentence and it is common to associate some kind of weight
to features according to their length to allow a long sentence to have a greater effect
on the outcome than shorter ones. Revisiting the above example, features like "the
naked truth" would probably have a low spam probability and have a greater weight
then the single token "naked" thereby allowing the message to pass.
In the Markovian filter CRM114, the features are given super increasing weights in
such a way that the probability from a long sentence is given enough decisive power
to outweigh any shorter sentence created by sub-stings, figure 13.
Token chain
Weight
The
1
The
Naked
The
<n/a>
Truth
<n/a>
Naked
<n/a>
<n/a>
Teenage
The
Naked
<n/a>
<n/a>
Teenage
Pregnancies
64
The
Naked
Truth
About
Teenage
Pregnancies
256
4
4
16
2*N-1
Figure13, super increasing weights: 2
The local probabilities when using super increasing weights is calculated from:
Pspam (vi ) = 0.5 +
( P (vi | ham) − P (vi | spam)) * weight
( P (vi | ham) + P (vi | spam) + 1) * weight max
(7)
There is no golden rule for weights in Markovian spam filters. A less complex version
that has been developed for this project is:
| feature | ⎞
⎛
P(vi | spam) = P(vi | spam) × ⎜ 0.5 + order−1 ⎟
2
⎝
⎠
(8)
The formula in (8) does not use super increasing weighting but varies weights
between 0.5 and 1.0. Both (7) and (8) have been implemented and tested. For the
implementations in this project, (8) has been found to perform better than (7). The
Improving Spam Filtering by Training on Artificially Generated Text
- 28 -
reason for this result might be that the mapping and feature construction in CRM114
is different from the ones used in this project. It is not considered relevant for the
comparison of Bayesian and Markovian filters to give an elaborate discussion on the
effectiveness of different weighting schemes.
In (Graham, 2003), Graham derives many extra features compared to simple
tokenising by combining tokens with information of their origin into new features.
Graham’s motivation for this is to construct a feature space as large as possible to
ensure that all information in a message is utilised. This expansion of feature space is
reported to improve performance (Louis, 2003b). To test whether the generation of
additional features in the form of n-grams without any weighting could increase
performance an experiment were performed where all weights were set to 1.0
(weighting was not used). There was not a significant difference between a weighted
and a non-weighted filter (0.9913 ± 0.0066 vs. 0.9908 ± 0.0049, P=0.784), but all
tests suggest that the best performance is achieved using weights.
To compare the performance of Bayesian and Markovian spam filters they have been
tested on the same dataset using the same implementation. Three experiments are
shown here:
1. Performance on a small data set.
2. Performance on a medium sized data set.
3. Performance on a large data set.
No messages from the test set are used for training as shown in figure 8.
Experiment 1: small training set.
To compare the performance of the two classifiers when little data is available for
training as is the case in early stages of spam filtering, the filters are compared on a
training set of 20 messages. The training set consists of 10 spam and 10 ham
messages.
On a 30-fold cross validation with 200 random messages in the test sets, the results
from a paired sample T-test was: (0.9446 ± 0.040 vs. 0.9459 ± 0.034, P=0.460). The
data is noisy which leads to a high standard deviation for the samples. There is a
numerical but not statistically significant difference between the Bayesian and
Markovian filters in the favour of the Markovian filter.
The reason why the difference is not easily significant on smaller training set is noisy
data due to randomness involved in selecting files to participate in data sets each run.
This noise would have been reduced by using larger data sets (see experiment 2 & 3).
To examine the characteristics of the two classifiers their ROC graphs are plotted
from a single typical run with a test set of 370 spam messages and 240 ham messages,
figure 14. Please note that the Markovian filter in figure 14b is an order-4 filter. This
is to best demonstrate potential differences between the two algorithms.
Improving Spam Filtering by Training on Artificially Generated Text
- 29 -
Figure 14a BayesianSpamFilter
Figure 14b Markov-4
The ROC curves of the two classifiers show different trends. The graph in figure 14b
has a larger AUC than the graph in figure 14a. The larger area is likely to be a result
of Markovian-4 filter's ability to make fewer FP errors than the Bayesian filter. The
curve in figure 14b has fewer FP errors from about TP=270. However, the Bayesian
filter reaches FP=0 at TP=115 while the Markovian filter get stuck in FP=1 from
TP=153 to TP=72 before entering TP=0 (not shown in figures).
In this experiment the Markovian filter seem to outperform the Bayesian filter.
Improving Spam Filtering by Training on Artificially Generated Text
- 30 -
Experiment 2: medium training set.
To compare the performance of the two classifiers when some data is available for
training as in intermediate stages of spam filtering, the filters are compared on a
training set of 200 messages. The training set consists of 100 spam and 100 ham
messages.
On a 30-fold cross validation with 200 random messages in the test sets, the results
from a paired sample T-test was: (0.9830 ± 0.105 vs. 0.9887 ± 0.068, P=0.0). There is
a clear statistically significant difference between the Bayesian and Markovian-3
filters in the favour of the Markovian filter. In this experiment the samples have a
lower standard deviation than in experiment 1 due to a larger training set, hence less
randomness.
To examine the characteristics of the two algorithms, four ROC graphs are plotted
from a single typical run with a test set of 370 spam messages and 240 ham messages,
figure 15.
Figure 15a BayesianSpamFilter
Figure 15b Markov-2
Figure 15c Markov-3
Figure 15d Markov-4
Improving Spam Filtering by Training on Artificially Generated Text
- 31 -
The graphs in figure 15 suggest that there is a transition towards better performance
when using longer chains as features. The AUC is higher for Markovian filters but
more interestingly, the FP error rate is decreased. The trends from the four graphs are
summarized in table 3. The column False Positives refer to the intersection between
the ROC curve and the X-axis (at True Positives = 320).
Classifier
Bayesian
Markovian-2
Markovian-3
Markovian-4
AUC
0.9863
0.9887
0.9909
0.9916
False Positives
7
4
3
2
Table 3, Markov length and performance
Table 3 shows a trend where classifiers with longer chains have greater AUC and
makes fewer false positives.
To confirm the trend suggested by the data in table 3, cross validated runs were
performed to compare Bayesian (Markovian-1) with Markovian-2 and Markovian-2
with Markovian-3. The runs were performed using a training set of 100 + 100
messages and a test set of 200 in a 30 fold cross validation.
The result of a paired samples T-test for Markovian-1 vs. Markovian-2 was (0.9838 ±
0.01 vs. 0.9862 ± 0.009, P=0.0). The result for a similar test on Markovian-2 vs.
Markovian-3 was (0.9866 ± 0.008 vs. 0.9888 ± 0.006, P=0.0). Both tests show that
there is a significant improvement between filters using different chain lengths.
Figures 15b,c and d show how the FP error is decreased as longer chains are used as
features. This property is particularly desirable in spam filtering as FP errors tend to
have a much higher cost than FN errors (see section 3.1).
In this experiment the Markovian filter outperform the Bayesian filter.
Improving Spam Filtering by Training on Artificially Generated Text
- 32 -
Experiment 3: large training set.
To compare the performance of the two classifiers when much data is available for
training as in the normal conditions of spam filtering, the filters are compared on a
training set of 1000 messages. In this experiment, there is not enough data to perform
cross validation as described for the previous experiments. Instead, the entire corpus
is used as a pool to randomly pick 1000 messages for training and 200 messages for a
test set. The performances are recorded in a 30 fold cross validated run. The results
are tested for significance using a paired samples T-test with the result (0.9989 ±
0.0018 vs. 0.9995 ± 0.0009, P=0.034). The difference is statistically significant on a
95% confidence interval. During the 30 folds, the Bayesian filter only had better
results that the Markovian filter on two occasions. The reason for why the difference
was not significant using a 99% confidence interval is that the margins are so small
between optimal and actual performance; During the cross validated run, the
Markovian filter achieved a AUC of 1.0 on 16 of 30 folds.
To examine the characteristics of the two algorithms, two ROC graphs are plotted
from a single typical run with a training set of 500 spam messages and 500 ham
messages and a test set of 370 spam messages and 240 ham messages, figure 16.
Figure 16a BayesianSpamFilter
Figure 16b Markov-3
When spam filters have been allowed to train on 1000 messages the performance is
quite good, especially in the Markovian-3 filter, figure 16b. Even if the user were to
require that no false positives should occur, the classifier in figure 16b would have an
accuracy of (240+356)/(240+370) = 0.977. To select such a threshold, the
getOptimalClassifier(slope, direction) function described in section 3.4 could be
called with the arguments (0, 'L'). With the mentioned optimal classifier, all 240 legit
messages would get through and 356 of 379 spam messages would have been
blocked. In most practical cases, the number of spam messages in the inbox can be
drastically reduced by allowing a low number of false positives, though this may not
be an option to all users.
The conclusion of this experiment is that the Markovian filter outperform the
Bayesian filter for the given settings.
Improving Spam Filtering by Training on Artificially Generated Text
- 33 -
The experiments performed, and particular experiment 2 show how the performance
of a spam filter is increased by using longer Markov chains. However, the increased
performance comes at a cost. For the implementation in this project, the run time of a
spam filter is doubled by increasing the Markov-chain length by one token.
The computational costs and performances for Markovian spam filters are shown in
table 4. The measurements are taken when learning a training set of 500 messages and
consequently classifying a test set containing 1100 messages.
Markov Order
Vocabulary size
Time used: seconds
AUC
1
25267
2.4
0.9893
2
82269
5.5
0.9911
3
216230
11.5
0.9923
4
509173
24.0
0.9933
5
1135530
49.0
0.9945
Table 4, costs versus performance
The size of the training set linearly affects the run time of both Bayesian and
Markovian spam filters
Based on the findings in this section, it is desirable to use Markovian filters of as high
order as possible. However, due to the computational costs, it is not practical to use
Markov chains longer than 5 as by Yerazunis (2004). In their work Androutsopoulos
et. al. (2004) concludes that Markov chains longer than 3 is unpractical. Through the
experiments performed in this section it seems like 3 or 4 is a reasonable number
given the current implementation. The amount of mail and the available hardware, in
particular RAM, also plays a key role in choosing the best performance/cost ratio.
The results from the performed experiments show that Markovian spam filters will
have better performance than Bayesian spam filters when the two are compared in a
unbiased test. In this context, better performance is defined as having higher AUC and
making fewer mistakes, especially false positives.
These findings are more significant than those reported for CRM114 by Yerazunis
(2004) who stated: “A markovian filter makes fewer errors than a Bayesian filter by
about the same margin as a light beer has fewer calories than a regular beer”.
This chapter has solved the secondary aim of comparing Bayesian and Markovian
filters. At the same time, the ROC curve theory and tools presented in chapter 3 has
been put to the test and proved extremely useful in analysing the differences and
similarities between different classifiers.
Improving Spam Filtering by Training on Artificially Generated Text
- 34 -
5 Using artificially generated text to improve spam
filtering
This chapter will discuss various techniques in using artificially generated text to train
spam filters. Section 5.3 contains experiments and results related to the primary aim
of this dissertation. Section 5.4 takes the text generation one step further and looks at
ways to generate text that is hard to classify which is a secondary aim.
5.1 Basic idea of Interpolating
The main investigation of this project is to measure the effect of a new training
scheme, or more specific: what happens to the performance of a spam filter when
artificially generated text is used to enlarge the amount of available training data.
One of the motivating factors to investigate the effects of using artificially generated
text as training data stems from the observation that there is a strong correlation
between the size of the training set and the performance of a classifier. Figure 18
shows the relation between a spam filter’s AUC as the size of the training set
increases. Note that the graph in figure 17 is not linear as the X-axis has un-even
intervals..
The effect of training
1
AU C
0.98
0.96
0.94
0.92
0.9
10
20
50
200
1000
Size of trainingset
Figure 17, performance and size of training set
Improving Spam Filtering by Training on Artificially Generated Text
- 35 -
There are several approaches to text generation e.g.
• Cut-and-paste text from internet/news groups
• Use Natural Language Generation from a dictionary
• Generate text by extracting tokens randomly from a email dataset
• Generate text similar to email corpus using Markov Chain Text Generator
• Generate text using a GP with a set of grammar rules.
Intuitively one would require the generated text to be 'valid' text; readable to the
human eye. The justification for this assumption is that the artificially generated text
will serve as a supplement to the received email and should resemble it in content and
style. However, this assumption might be biased by the fact that the human brain does
not easily visualise the complex probability mapping of a naïve Bayes text classifier.
As long as a Bayesian spam filter uses atomic tokens as features, the ordering of the
words in the text is irrelevant. This simplification is a component of the naïve Bayes
assumption as explained in section 4.3.
The basic idea of interpolating requires the generated text to resemble the actual email of the individual user in wording and content. The Markov Chain text generator
(MCTG) has properties making it well suited to produce text satisfying these
criterions due to its ability to mimic the style of supplied data.
As the name suggests, the MCTG assumes that text can have the Markov property and
it treats tokens in text as a Markov Chain, or more formally a finite-state Markov
chain with stationary transition probabilities. The definition of Markov chains is given
in section 4.4.
A MCTG has a learning stage and a generation stage. In the learning stage the MCRG
analyses a training text and constructs a Markov model based on observed transitions
and counts. For every distinct token occurring in position posn of a corpus, a set of
transition probabilities is assigned to any token occurring at posn-d in the corpus
where d is the depth (Order-d Markov chain).
More formally:
value(t n ) = arg max p(t n− d , t n− d +1 ...t n−1 )
t∈T
In the generation stage the Markov model (table of transitional probabilities) is used
to construct sentences from words. By knowing the d preceding tokens of any new
token t we can estimate the probabilities of different possible values of t. This process
is repeated using t as a preceding word for a new lookup.
Most references to Markov text generation theory credits Shannon (1948) as the
inventor of Markov-chain text generation. In his work he described how transitional
probabilities could be extracted from a section of text and be used to generate similar
text. Although Shannon could not test his algorithm in large scale due to the fact that
this kind of work would be nearly impossible without the use of a computer, his idea
is still the foundation for most implementations.
Markov models can be created based on characters, words or sentences. The MCTG
produces text with a depth specified by the user. The character-level MCTG produces
interesting results but will produce words which do not occur in the presented text.
Improving Spam Filtering by Training on Artificially Generated Text
- 36 -
When using word-based Markov models the generated text mimics the presented text
and does not introduce new token to the spam filter.
Some examples of generated text are given under:
Order-1 word-chain:. On the most n floating point for instance). Bitmap Data
Structure. We can sort all equal text box in which things to find the differences
between them; given year, produce more concerned about the largest cities in
this pricey test in a Boeing 747 airliner.
Order-3 word-chain: Initialize the cumulative array and Algorithm 3 uses a
simple form of divide-and-conquer; textbooks on algorithm design describe
more advanced forms. Problems on arrays can often be solved by searching for
each array element in order: first x[0], then x[1], and so forth.
We see that the text produced gradually becomes more sensible as the depth increases.
With word-chain and depth of one or two (Order-1, Oreder-2 respectively) the text
contains many grammatical errors and does not make much sense. Order-3 text looks
like it is normal text but by reading it we realise that it does not make sense. Order-4
text is nearly perfectly readable
When using character-based Markov models the generated text mimics the character
transition in the learning text. Due to randomness in selection, new tokens are
introduced into the training data.
Some examples of character-level text are given under:
Order-1 character-chain: nome atwoce te smstarond toce acthare est juthel vers
Th ay theome aytinglyt f vinca tiforachthedabj) thionfo rpobaro ske g,
beruthainsse iedif ton eane ustioutinde. titesy s th g ronpromarace s,
Wedesimess ctis wan spe thivausprillyton e bougre inorunds RACPore
Order-3 character-chain: Thomogets, we difficients then space the in run of the
square mats in easure dointerated that peral so repreter Read the ROM oring
tencodescribut with arrays of throughly coding Spots: a 2080; requires the
number ints load to red game-playing entall column
Improving Spam Filtering by Training on Artificially Generated Text
- 37 -
5.2 Tools used
Two text generators have been implemented, one for character-level and one for
word-level text. It is possible to generate both types of text using the same
implementation/algorithm but with no control of the staring point of the text.
The character-level text generator implemented in
CharacterLevel_MailGenerator.java uses a very simple algorithm well explained in
chapter 15 of (Bentley, 2000). The program is implemented in less than 150 lines of
code. The technique used is to track any character suffix following a character-string
prefix. To keep the programming part simple, a suffix is represented by a StringBuffer
which is concatenated with any suffix character encountered for a prefix during
learning.
The algorithm for learning a character-level Markov model from text is shown under.
1:
2:
3:
4:
5:
6:
7:
8:
Set Mapping = a map from prefixes to suffixes
Set prefix = the n first elements of the text file
while(next element, suffix, is not EOF)
if(prefix previous unseen)
put prefix and a suffixList in the mapping
else
add suffix to suffixList
prefix = prefix – prefixElementAt[0] + suffix
When generating text, a starting point is set and the following character is randomly
chosen from the suffix string. To generate credible email, the starting point of text
generation is always set to the n first characters in the word 'From ' where n is the
order of the Markov chain. The text generator is trained by a call to the function
learnMarkovModel(path of corpus). New text is generated by calling generateText(
destination path, length of text file).
The algorithm for generating character-level Markov text is shown under.
1:
2:
3:
4:
5:
6:
7:
Set prefix = the first n elements to start a new text file
Write prefix to file
Set terminationCriteria = a stopping point for generation
while(terminationCriteria not fulfilled)
suffix = suffixList[random number Є (0, |suffixList|)]
Write suffix to file
prefix = prefix – prefixElementAt[0] + suffix
Improving Spam Filtering by Training on Artificially Generated Text
- 38 -
The word-level text generator implemented in WordLevel_MailGenerator.java is
much more complex than the character-level implementation and uses over 300 lines
of code. The implementation is based on the implementation of Markovian text
generators in (Kernighan and Pike, 1999)
One reason for the more complex program is that words need to be read in the order
they are presented. To do this, a text document is buffered and a ArrayList of n words
is used as a prefix. The collection of words (suffixes) following a prefix is stored in a
ArrayList mapped to each prefix.
Another reason is that due to the desire to produce text that is very similar to real
email the generated text should maintain the semi-random structure of email messages
which starts with a head and ends with a body. To achieve this goal, a list of starting
prefixes is gathered during learning and a randomly chosen element from this list is
used to start new text documents.
The use of the word-level text generator is similar to the character-level text
generator.
The algorithm for learning a word-level Markov model from text is:
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
Read training text into memory
Set Mapping = a map from prefixes to suffixes
Set prefix = the n first elements of the text file
Set firstPrefixes = a list of elements starting messages
Add first element to firstPrefixes
while(next element, suffix, is not EOF)
if(prefix previous unseen)
put prefix and a suffixList in the mapping
else
add suffix to suffixList
prefix = prefix – prefixElementAt[0] + suffix
The algorithm for generating word-level Markov text is:
1:
2:
3:
4:
5:
6:
7:
Set prefix = a random element from firstPrefixes to start a new text file
Write prefix to file
Set terminationCriteria = a stopping point for generation
while(terminationCriteria not fulfilled)
suffix = suffixList[random number Є (0, |suffixList|)]
Write suffix to file
prefix = prefix – prefixElementAt[0] + suffix
The termination criteria is fulfilled if the chosen suffix is "From" or "\n".
Improving Spam Filtering by Training on Artificially Generated Text
- 39 -
5.3 Interpolating training data with artificially generated text.
Based on the observations in figure 17 it can be hypothesised that larger training sets
in general enable classifiers to achieve better performance. This observation is
supported by theory (Mitchell, 1997). To test if this hypothesis could hold when the
training set consisted partially of artificially generated text a number of experiments
have been carried out.
The experiments measure the difference between a classifier using a normal training
set and a classifier using a training set where artificially text has been added. Most
experiments follow this algorithm for cross validated results:
1:
2:
3:
4:
5:
6:
7:
8:
8:
9:
10:
Set testCorpus = all files in test set
Set trainingCorpusNormal = all files in training set
Set generator = new WordLevel_MailGenerator
Double size of training corpus by generate artificial text
Set trainingCorpusAug = all files in training set
for( number of cross validations)
testSet = n random files from testCorpus
trainingSetNormal = m random files from trainingCorpusNormal
Return AUCnormal from classification
trainingSetAug = m random files from trainingCorpusAug
Return AUCaug from classification
The first experiment performed is to investigate if any the introduction of artificially
generated text could have any effect on the performance of a Bayesian classifier.
The above algorithm is applied to a dataset consisting of a training set containing 800
messages, 400 of each class and a test set of 350 ham and 450 spam messages. In line
4 of the algorithm, 800 messages are added to the training set so that it contains 1600
messages. The generated text consisted of 2x400 messages generated using an order-3
Markov model calculated from the messages of each class in the original training set.
The experiment was performed using a 30 fold cross validation with 10, 25, 50, 100
and 200 as values for m. The averaged results are as shown in table 5.
Size of
training set
10
25
50
100
200
AUC
normal
0,911
0,959
0,969
0,972
0.985
SD normal
AUC augm
SD augm
Trend
0,0458
0,0213
0,0188
0,0150
0.0095
0,911
0,958
0,977
0,983
0.989
0,0430
0,0231
0,0268
0,0141
0.0063
-0.00037
-0.00076
0,0083
0,0116
0.0040
Table 5, performance after adding artificially generated text
The cross validated results from the above experiments suggest that there is indeed a
positive effect of augmenting the training set wit additional artificially generated text.
To check if the results are just a coincident, the samples are tested for significance in a
paired samples T-test using SPSS. The results are shown in figure 18.
Improving Spam Filtering by Training on Artificially Generated Text
- 40 -
Paired Samples Test
Paired Differences
Pair 1
Pair 2
Pair 3
Pair 4
Pair 5
BNO200 - ANO200
BNO100 - ANO100
BNO50 - ANO50
BNO25 - ANO25
BNO10 - ANO10
Mean
-,003938
-,011571
-,008269
,000750
,000370
Std.
Deviation
,0088324
,0179878
,0307790
,0323690
,0646046
Std. Error
Mean
,0016126
,0032841
,0056194
,0059097
,0117951
95% Confidence
Interval of the
Difference
Lower
Upper
-,007236
-,000640
-,018288
-,004855
-,019762
,003224
-,011337
,012837
-,023754
,024493
t
-2,442
-3,523
-1,471
,127
,031
df
29
29
29
29
29
Sig.
(2-tailed)
,021
,001
,152
,900
,975
Figure 18, output from SPSS
Figure 18 show that there is a statistically significant difference between the AUC of
classifiers before and after interpolating using a 95% confidence interval for training
sets of 100 and 200 messages. The findings from table 5 and figure 18 suggest a trend
where the method improves the performance of a Bayesian spam filter when applied
to training sets containing more than 50 messages.
These findings are highly interesting and represent very much the desired result.
Motivated by the findings in table 5, experiments were performed to compare the
effect of different text generators. Four text generators were tested: order-3 word
level, order-1 word level, order-3 character level and order-4 character level. 800
messages were generated and added to a training set of 800 messages as above. The
effect was measured on a Bayesian filter over a 100-fold cross validation with a
training set of 100 messages and a test set of 200 messages. The resulting
performances and optimal classifiers are presented in table 6. The fields FP and TP
represents coordinates of an averaged optimal classifier with a cost ratio of 6.67.
Training set
WOrder-3
WOrder-1
ChOrder-3
ChOrder-4
AUC norm
0.9722
0.9739
0.9734
0.9724
AUC augm
0.9776
0.9849
0.9771
0.9818
FP norm
3.6
3.2
3.0
3.7
FP augm
2.8
2.3
2.8
2.9
TP norm
13.9
14.1
15.4
12.9
TP augm
12.4
8.8
13.3
9.7
Table 6, augmented vs. normal training set on Bayesian filter
Paired samples T-tests for the four results reveal:
WOrder-3
-> 0.9722 ± 0.0163 vs. 0.9776 ± 0.0155, P = 0.003
WOrder-1
-> 0.9739 ± 0.0163 vs. 0.9849 ± 0.0088, P = 0.0
ChOrder-3
-> 0.9734 ± 0.0182 vs. 0.9771 ± 0.0133, P = 0.049
ChOrder-4
-> 0.9724 ± 0.0174 vs. 0.9818 ± 0.0093, P = 0.0
All four methods produce improvements that are statistically significant.
The findings in table 6 show that both text generated using word based and character
based Markov model will improve spam filtering performance. In addition to a
significant increase in AUC, the classifiers also move towards fewer errors. More
interestingly; there is a decrease in false positives. This property is highly interesting
in spam filtering, especially when the cost ratio between FP/FN is high.
Improving Spam Filtering by Training on Artificially Generated Text
- 41 -
There are clear differences between the effects of the four generators. The effect of
using order-1 word level text is much greater than the effect of the order-3 word level
generator. When planning the experiment it was expected that the text generated by
order-3 would be more beneficial than the order-1 text because it would resemble the
original text more. This assumption is proven to be wrong.
Based on these findings, an experiment is performed where an order-0 word level text
generator is implemented and used to augment the training set. The order-0 generator
does not construct a Markov model of transitions but simply picks 500 random words
from the training text and duplicates them. This is not the same as generating random
text since the distribution of words in the original text is maintained. If the original
text has 10% chance of a word being “Free”, then the generated text also has a 10%
chance since the random selection implemented in Java has a uniform probability
distribution.
The experiment is identical to those in table 6 but this time with the new order-0 text
generator implemented in RandomWordGenerator.java. Following the trend from
order-3 to order-1 word level text, the order-0 text offers a further increase, boosting
the AUC from 0.9755 to 0.9880. This increase is greater that any of the generators in
table 6. By interpolating using the order-0 word generator, the optimal classifier
changed from (3.0, 14.5) to (2.0, 7.0).
The findings so far in this section show that spam filters have different performances
when classifying on a normal training set compared to a augmented training set. In an
attempt to identify the difference between the two classifiers we analyze their ROC
curves.
The ROC curves in figure 19a & 19b and 19c & 19d are generated from two iterations
in a run as described for table 5 using a order-3 word level text generator.
Figure 19a
Figure 19b
Improving Spam Filtering by Training on Artificially Generated Text
- 42 -
Figure 19c
Figure 19d
In figure 19b, the classifier seems to make fewer FP errors than in 19a; it also has a
higher AUC. The differences between figures 19c and 19d are very obvious. In figure
19d, the classifier is able to keep the FP error at a 0-lever until more than 100 true
positives while the classifier in figure 19c starts making false positive prediction from
only 40 true positives. All graphs in figure 19 are generated using the same size of
training sets. Keeping in mind that these figures do not represent cross validated data,
they still suggest a general trend where classifiers using training sets augmented with
additional data make fewer FP errors.
In the experiments leading to figure 19 the artificially generated data had a balanced
class skew and the resulting augmented training set had a 50/50 ratio of ham/spam
messages. Despite this, figure 19d show an extreme skewing from figure 19c
considering they both had the same test set. One possible explanation to this
observation could be that spam messages tend to be longer than ham messages, but it
turns out that the data-size skew was 1.5 before and 1.56 after additional data was
added.
The optimal classifiers in the above experiments are based on a iso-accuracy line with
cost ratio 6.67. To check whether this skewed cost might affect the results in any way
a 50 fold cross validated run was made where optimal classifiers were calculated from
a balanced cost ratio of 1.0. In this run the AUC went from 0.9720 to 0.9874 and the
optimal classifier from (8.1, 2.0) to (4.7, 1.3). The optimal classifier with a cost of 1.0
is quit different from the optimal classifier with cost of 6.67 but it still demonstrate a
tendency of moving towards a more desirable result with fewer errors and most
important fewer false positives.
To investigate how different composition of augmented training set affect the result, a
run was performed where only 400 spam-based messaged was added to training data.
The resulting AUCs were 0.9763 before and 0.9717 after interpolating, a decrease in
performance. The optimal classifier (cost ratio = 1.0) changed from (6.8, 2.4) to (7.8,
2.3), producing a poorer FP-rate.
It seems clear that a Bayesian spam filter will benefit from having artificially
generated text added to the training set, but at a balanced skew.
Improving Spam Filtering by Training on Artificially Generated Text
- 43 -
In an attempt to find any clear trend in the effect of interpolating, experiments were
carried out using an order-3 Markovian spam filter. Similar experiments as those
described in table 6 were carried out, but using only a 50-fold cross validation due to a
large computational cost. The results are shown in table 7.
Training set
WOrder-0
WOrder-3
ChOrder-4
AUC norm
0.9799
0.9776
0.9777
AUC augm
0.9870
0.9865
0.9812
FP norm
2.5
2.6
2.9
FP augm
1.7
1.8
2.9
TP norm
10.9
12.9
12.9
TP augm
9.9
10.7
8.9
Table 7, augmented vs. normal training set on Markovian filter
Table 7 show that also the Markovian spam filter will benefit from having artificially
generated text introduced to the training set. Comparing table 6 to table 7 we see that
the Markovian filter has a slightly different reaction to the augmented data. The
performance increases in table 7 is not as great as those observed in table 6. This is
partially due to the fact that performance increases are harder to achieve as the AUC
gets closer to 1.0.
When trained on word level data, the performance increase for a Markovian filter is
about the same for order-0 and order-3. On the Bayesian filter, these two types of text
produced a very different result.
When trained on character level data, the Markovian filter has little improvement in
performance compared to that of word level text. The explanation for this might be
that while the word level text helps the filter in calculating a more correct word count
of the actual word distribution in email, the character level text is not capable of
producing useful text that can guide the spam filter in finding a target function. The
reason for why the Bayesian filter has a fair performance increase might be that the
character level text generator increases the size of the classifiers vocabulary. Since
spam messages often contain tokens which are not valid English (or any other
language), a large vocabulary might have a listing for some words seen in the test set.
This property is not as useful in a Markovian filter which uses chains of several
combined words (in addition to atomic tokens) and is therefore unlikely to have a
match for a random and invalid token. In addition, the vocabulary of a Markovian
filter is larger than of a Bayesian filter (see table 4) and the positive effect of a
increased vocabulary size will not be as significant.
When examining the results of experiments carried out in this section, the most likely
explanation for the performance improvement is that the interpolated training set
reduces over fitting by smoothing the output function of the filter so it more accurate
can approximate the target function of the email. Judging from the performance
improvements in table 6 and table 7 it seems like a typical spam filter is trained to a
high level of over fitting. By adding some randomness into the training set the word
counts are smoothened to less extreme patterns. This in turn affects the probabilities
learned by the spam filter and allows it to learn a less specific function. This property
is very interesting in real-life spam filtering du toe the fact that the target function is
constantly changing and actively evading spam filters (Fawcett 2003a).
Improving Spam Filtering by Training on Artificially Generated Text
- 44 -
From figure 19 and tables 6 & 7 it seems that fewer false positive errors are made
when classifying based on a probability mapping learnt from a augmented training set.
This observation is closely related to the observation that the additional data reduces
over fitting. An analysis performed on the content of messages misclassified as spam
revealed that they contained features that are also seen in spam messages. These
features includes html tags and some mime attachment. By smoothing the data, the
features shared by spam and ham are skewed towards innocent.
Example: the html tag “<a href=” is observed in 40 spam messages and 10 ham
messages and might have a spaminess of 40/50 = 0.8. By adding 50 new occurrences
of the html tag through artificial text generation, 30 in spam and 20 in ham, the
spaminess is reduced to 70/100 = 0.7. This reduction of 0.1 might not seem very
significant but when calculating combined probabilities of many numbers, the result
might be drastically changed. The effect of making some of these features less
extreme is that ham messages that contain shared features are less affected in the
wrong direction. This leads the spam filter to become more conservative and make
fewer false positive errors.
Improving Spam Filtering by Training on Artificially Generated Text
- 45 -
5.4 Generating tougher text
Throughout section 5.3 the training set was augmented by adding text generated by a
simple and well known algorithm called the Markov text generator. The method was
shown to improve spam filtering. In this section, we will investigate if spam filtering
can be further improved by adding artificially generated text with has been generated
using feedback from a spam filter. Two different approaches are explored in deciding
what makes text tough, a spaminess map from a spam filter and an entropy analysis of
features.
The idea leading to this approach is that spam filters encounters some features that are
easy to label to a class and some features that are not. We wish to investigate the
effects of encouraging the text generator to insert some groups of words into new text.
This idea is inspired by and therefore resembles the techniques of reinforcement
learning and co evolution. In reinforcement learning, an active agent receives
feedback of its performance and is able to adjust its behaviour (policy) according to
the feedback in order to perform better (Sutton and Barto, 1998). In co evolution a
system attempts to optimise some target function and is dependent on hard test cases
in order to avoid stagnation (premature convergence). The hard test cases are thought
of as parasites which actively try to make it hard for the system to solve them (Hillis,
1990).
In an attempt to generate text which is hard to classify for a spam filter the
observations made from figure 12 are exploited. In a PaulGraham filter, the majority
of messages are assigned either a very low or very high spaminess. By encouraging
words with a spaminess in between the two extremes to appear in the text, the
resulting text should produce a less extreme spaminess score. This idea has been
implemented in WordLevel_MailGenerator as the function generateThoughText. The
function requires an instance of PaulGraham to supply it with a spaminess-map.
When a simulated message is generated, the function first generates text similar to in
section 5.3 and then adds an equal amount of text at the tail of the message. This
additional text consist of features with a spaminess between 0.2 and 0.8. The features
are picked from the probability mapping using a uniform random function. The
though text is random since it is picked from a list with no duplicate values.
To investigate the effect of using tough text, a 30 fold cross validation similar to that
used for table 6 but now using a training set of 150 was performed. The text was
generated using the generateThoughText function with a order-3 text generator and a
probability mapping from an order-1 PaulGraham filter. The results showed that a
Bayesian spam filter improved its AUC from AUC=0.9781 to AUC=0.9834.
The same experiment repeated with an order-3 spam mapping produced an
improvement from AUC=0.9766 to AUC=0.9839. The optimal classifier using a cost
ratio of 6.67 changed from (3.0, 12.3) to (2.0, 9.6).
Comparing these numbers to table 6 it seems that the introduction of tough text using
a order-3 word level text generator increases the performance of a Bayesian spam
filter at least as much as the normal text generation.
Improving Spam Filtering by Training on Artificially Generated Text
- 46 -
The reason for this might be that the order-3 text helps the filter in assigning high
spam probability to words that are often seen in short distance from typical spam
words. Another explanation might be that the introduction of random words in effect
reduces the order-3 text to a mixture of order-3 text and random words. It had been
shown in table 6 that text of lower order produces better results.
Similar experiments were performed to record the reaction of a order-3 Markovian
spam filter when tough text was introduced to the training set. On a 30 fold cross
validated run using a order-1 spam mapping the performance increased from
AUC=0.9842 to AUC=0.9913. When using a order-3 spam mapping the performance
increased from AUC=0.9838 to AUC=0.9899.
It seems like there is little difference in how a Bayesian and a Markovian filter
respond to the tough text. Both of them experience an improved AUC but the
improvement is less than what was detected in section 5.3.
In attempt to increase the bias towards fewer false positives, the above experiment
was repeated but this time only generating tough spam-like text. The resulting training
corpus contains 1 part ham, 1 part spam and 1 part spam-look-alike. The results show
a performance decrease from AUC = 0.975 to AUC = 0.968. The optimal classifier
went from (2.9, 13.3) to (4.0, 13.5). Even though the AUC is slightly decreased and
the FP for the optimal classifier is increased, the TN is stable..
Section 4.2 discussed how email data sets differ from many other types of data,
especially when the goal is to process the data using a naïve Bayesian classifier. (Fu
and Silver, 2004), showed that there was a statistical significant difference in
performance when email was treated like unstructured data and when the chronology
of the data was maintained during experiments. In the cross validated runs, the files in
the test set are always of a later date than the training set but the randomness involved
in generating training sets often put files received over a long time period adjacent.
This is not a critical issue when using a fixed training set. It would however be an
error source when iterative data sets were used. Based on these observations, some
experiments have been carried out where the random shuffling of cross validation has
been replaced by manual dataset construction with chronological correctness.
Using a training set of 50 spam messages and 50 ham messages and a test set of
preceding messages, 200 spam and 200 ham, some experiments where artificial text
was added to the training set was performed. The artificial text was generated 30
times and added to the same training set. The AUC before interpolating was 0.722.
The results are shown in table 8.
Training set augmented with:
50 ‘ham’ + 50 ‘spam’
50 tough ‘ham’ + 50 tough ‘spam’
50 tough ‘spam’
Interpolated AUC
0.9742
0.9748
0.9735
Trend
+0.002
+0.0026
+0.0013
Table 8, augmented vs. normal training set on Bayesian filter
The findings in table 8 contradict the findings from cross validated experiments and
show an increase in the performance when trained on tough text.
Improving Spam Filtering by Training on Artificially Generated Text
- 47 -
In some experiments, the introduction of only tough ‘spam’ yields a noticeable
increase in AUC. Figure 20 shows the ROC curves from a run on a Bayesian classifier
where 100 tough ‘spam’ messages are added to a training set of 100 +100 messages.
In figure 20b, the filter makes less FP errors then 20a in most of the graph. From a
spam filtering point of view this is a desirable development.
Figure 20a
Figure 20b
Figure 19 demonstrates that the effect of interpolating varies with different sizes of
training sets and that the experiments in section 5.3 was most useful for training sets
containing 100 messages or more. To check if this claim would hold on a non-cross
validated run, an experiment was carried out on a training set containing only 20
messages, 10 of each class. The resulting performances are shown in figure 21.
Figure 21a
Figure 21b
Figure 21b shows an increase in AUC after interpolating. The reason for why
Both figure 20b and figure 21b share the same trend of decreasing the number of false
positive classifications. If these results are correct it seems like the augmentation
applied to the training set serves as a way of biasing the classifier to be more
Improving Spam Filtering by Training on Artificially Generated Text
- 48 -
conservative. One explanation to this observation is that the artificially generated
spam introduces a bias towards fewer false positives. By adding spam-like messages
to the training set, the prior probability is skewed from 0.5 to 0.67. However, a simple
bias only skews the classifier from liberal towards conservative and we would expect
that the improvement in one type of error would come at the cost of the other kind of
error like observed in the cross validated experiment. Both figure 20b and figure 21b
demonstrate that even though fever FP errors are made the filters still have a larger
AUC and have more true positives than their counterparts figures 20a and 21a.
If the observations are correct, it seems like the introduction of spam-like text can bias
the classifier towards fewer false positives and at the same time maintain a acceptable
performance, possible through a reduction of over fitting as discovered in section 5.3.
Similar results have not been observed in cross validated experiments and the validity
of the findings can not be guaranteed.
In section 5.3 it was discovered that the best results were achieved when training on
order-0 text. Based on this observation an experiment was performed where the
artificial text generated contained only tough words and no Markov text. The results
showed no improvement compared to the original training set.
The second approach to selecting features that are hard to classify is through the
measurement of entropy. Two different calculations are used: Mutual Information and
Information Gain. Both measure the expected reduction in entropy caused by
partitioning the examples according to a given feature. In this dissertation Mutual
Information is defined as by (Cover and Thomas, 1991):
MI ( X ; C ) =
∑
X ∈{0 ,1}, C∈{ spam , ham }
P ( X = x, C = c ) ⋅ log
P( X = x | C = c)
P ( X = x ) ⋅ P (C = c )
Information Gain is defined as Mutual Information is by (Yang and Pedersen, 1997):
IG( X ; C ) = log
P( X ∧ C )
| X ∧ C | ⋅ | TrainingSet |
≈ log
P( X ) ⋅ P(C )
(| X ∧ C | + | Xˆ ∧ C ) ⋅ (| X ∧ C | + | X ∧ Cˆ )
The different notations are chosen to avoid mix-ups. In discussions, both MI and IG
are referred to as information gain.
To calculate the information gain from features, an instance of the IG spam filter is
used. IG has functionality for returning a list of words that satisfy a criterion for
entropy with the functions getMutualInformation(numberOfWords, quality) and
getInformationGain(numberOfWords, quality). The numberOfWords attribute decides
how many words should be included in calculations. After calculations, the words are
sorted according to their information gain. The quality attribute is used to choose
whether to use the highest (0), the intermediate (1) or the lowest (2) 33% of the words
calculated.
The effect of training a Bayesian spam filter on a training set augmented with
artificially generated text as described above was measured on a 50 fold cross
Improving Spam Filtering by Training on Artificially Generated Text
- 49 -
validated run as described by the algorithm in section 5.3 with 100 files in training set
and 200 in test set. The first half of the text in each message is order-1 word level text.
Text generated by adding words decided by the MI entropy reduction formula using
the first 333 of 1000 words increased the filters performance from AUC=0.9715 to
AUC=0.9763. The decrease is not statistically significant (0.9715 ± 0.0148 vs. 0.9763
± 0.0147, P=0.108).
By using the last 333 words in stead of the first 333, the performance increased from
AUC=0.9731 to AUC=0.9866 which is significant (0.9731 ± 0.0167 vs. 0.9866 ±
0.0103, P=0.002). The optimal classifier changed from (3.2, 14.6) to (2.8, 8.6).
Even though the data is noisy, it seem to be a trend that filters make fewer errors after
interpolation with high-entropy data.
The above experiments were repeated using the IG formula for calculating entropy
reduction. By using the first 333 of 1000 words the performance was increased from
AUC=0.9775 to AUC=0.9790 on first 333 of 1000 words and AUC=0.9693 to
AUC=0.9786 for the last 333 words.
Some runs were performed on different setting in text generation. The results from
these and the above experiments are summarised in table 9. “MI, 1000, 0” means that
the tough text is picked from a the first 333 words of a list calculated by the MI
function. “MI, 2000, 2” would mean the last 667 words of a list of 2000.
Training set
MI, 1000, 0
MI, 1000, 2
MI, 2000, 2
MI, 500, 2
MI, 50, 0
IG, 1000, 0
IG, 1000, 2
IG, 1500, 2
IG, 2000, 2
IG, 2000, 0
“Normal”
AUC norm
0.9715
0.9733
0.9744
0.9739
0.9696
0.9775
0.9693
0.9768
0.9759
0.9768
0.9771
AUC augm
0.9763
0.9816
0.9814
0.9864
0.9852
0.9790
0.9786
0.9835
0.9839
0.9830
0.9844
FP norm
3.7
3.2
3.5
2.9
3.6
2.7
3.4
2.9
3.2
2.8
3.0
FP augm
2.7
2.8
2.6
2.4
2.0
2.7
2.5
2.3
2.4
2.7
2.5
TP norm
12.3
14.6
12.3
16.4
13.9
14.2
15.6
11.2
12.2
12.9
12.2
TP augm
13.7
8.6
9.6
7.2
9.1
11.7
14.0
7.6
8.1
6.5
8.2
Table 9, augmented vs. normal training set on Bayesian filter
The results in table 9 suggest that the generation of tough text is not a method worth
pursuing. Only on two occasions is the generated text marginally better than a
“normal” order-1 word level text generator but the difference is not statistically
significant (0.9844 ± 0.0109 vs. 0.9864 ± 0.0078, P=0.274) on “MI, 500, 2”.
Artificially generated text that has been supplemented with tough words has not been
shown to offer any improvement to the results discovered in section 5.3. Some of the
runs in table 9 have been repeated on a order-3 Markovian spam filter to check for
any differences in result between Bayesian and Markovian filters. The resulting
performances and optimal classifiers are shown in table 10.
Improving Spam Filtering by Training on Artificially Generated Text
- 50 -
Training set
“Normal”
MI, 500, 2
MI, 1000, 2
IG, 1000, 2
AUC norm
0.9792
0.9778
0.9779
0.9785
AUC augm
0.9885
0.9878
0.9875
0.9879
FP norm
3.3
3.1
2.9
2.6
FP augm
2.0
1.9
1.3
1.8
TP norm
14.5
10.3
12.6
10.9
TP augm
9.0
8.8
10.4
7.9
Table 10, augmented vs. normal training set on Bayesian filter
Similar to table 9, no result in table 10 seems to offer further improvement than that
offered by the basic text generator labelled “Normal”. Compared to the first
technique demonstrated where tough words were extracted from a spaminess
mapping, the high entropy text seem to perform even worse. There was a distinct
difference in the results returned when using traditional cross validation and
manual cross validation. The traditional cross validation is more noisy but fairer.
On the other hand, email filtering is a on-line learning task and any cross validation
is a unrealistic attempt to recreate the stream of new messages arriving to a inbox.
There is no clear evidence to suggest that using tougher text is beneficial. There is
some evidence that the introduction of only one class of artificially generated text
is an efficient way of biasing the filter, but probably at the cost of a reduced AUC.
Improving Spam Filtering by Training on Artificially Generated Text
- 51 -
6 Conclusion and further work
The first aim tackled in this report was to introduce ROC curves as a metric well
suited for measuring and visualising spam filter performance. In section 3, ROC
curves were discussed and a set of tools were implemented to perform basic
operations on ROC curves. A class representing a ROC graph was designed and
functionalities for measuring AUC, drawing graphs (using ROCOn) and finding
optimal classifier was implemented. The tools are designed in a generic way enabling
them to tackle other binary classification problems. The tool set has proved very
useful throughout the discussions in this dissertation.
Several extensions are possible to the ROC tools:
• Visualising ROC curves with zoom. ROCOn shows the entire graph, but in
spam filtering, the AUC is typically over 99% and it becomes impossible to
identify characteristics of curves on a 100% scale.
• Averaging of ROC curves as suggested by Fawcett (2003b).
• Analysis of the convex hull.
No comparative study has been carried out to compare the different metrics described
in section 3.2 with ROC curves. It is possible that other metrics may be more suitable
to measure properties of spam filters for some experiments in this dissertation.
However, throughout this report ROC curves have been used since their advocacy has
been one of the objectives stated for this dissertation.
In section 4, practical spam filtering was discussed and a comparison was made
between Bayesian and Markovian spam filter. The motivation for this comparison,
which also was stated as a secondary aim, was to decide which spam filtering
algorithm was the most accurate when compared under equal conditions. The
Markovian filter is really only an extension of the Bayesian filter and the two were
implemented in the same program to enable a fair comparison. Even though it is
possible that a Bayesian filter might have some properties making it preferable to a
Markovian filter, the experiments carried out in this dissertation showed that the
Markovian filter is significantly better than the Bayesian filter.
The implementations used for comparisons are very basic spam filters; they use a
simple tokeniser and implement only the simplest of algorithm. Other available
implementations are more specialised, using advanced tokenising, feature generation,
feature elimination, more complex ways of calculating combined probabilities and
more elaborate ways of weighting features. These features are deliberately kept out
since the comparison should be fair and because the extent of this project did not
allow for the kind of labour required to tweak the performances of filters.
The reason for why the Markovian performs better is credited to how it generates
more features from text than Bayesian filters and from its ability to solve problems
that are not linearly separable.
The primary aim of this dissertation, described in section 5.3, has been to measure the
effect of training spam filters on artificially generated text. From a theoretical point of
view, it was not expected that spam filters should benefit from the artificially
generated text since the transitional probabilities of words, and therefore the relative
Improving Spam Filtering by Training on Artificially Generated Text
- 52 -
frequencies, should not be altered. However, the findings presented in figure 19 show
that the method can improve the performance of a spam filter by a statistically
significant margin on training sets containing 100 or more messages. The discussion
in section 5.3 arrived at the conclusion that the larger amount of training data must
interpolate the output function arrived at during training. This way, over fitting is
reduced and the error between the target function and the output function is decreased.
At the same time, the larger training set seem to smoothen the more extreme
probabilities allowing a smoother transition from spaminess to haminess. This effect
enables fewer false positive errors to be made when the legit email contain features
often seen in spam. The effect seems to be greatest when a limited amount of training
data is available.
The last of the aims set for this dissertation was to measure the impact of training a
spam filter on text that had been made ‘tougher’ by using inside information from a
spam filter. By extending the experiments in section 5.3, groups of tough words were
selected and added to the generated text. Neither the tough words gathered from a
spaminess mapping nor from a entropy analysis offered further improvement from the
basic text generation devised in the section 5.3.
If the findings in this dissertation are correct and can be validated, the impact might
be a very useful technique for boosting the performance of Bayesian spam filters
when only a limited amount of training data is available. From the finding it can also
be suggested that much effort should be put into development of probability mappings
that are less affected by over fitting in early stages of filtering.
Improving Spam Filtering by Training on Artificially Generated Text
- 53 -
References
Androutsopoulos, I., Koutsias, J., Chandrinios, K., Paliouras, G. and Spyropoulos, C.
D. (2000a) An evaluation of naïve Bayesian anti-spam filtering. In
Proceedings of Workshop on Machine Learning in the New Information
Age(Eds, Potamias, G., Moustakis, V. and van Someren, M.), pp. 9-17.
Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Sakkis, G., Spyropoulos, C. D.
and Stamatopoulos, P. (2000b) Learning to Filter Spam E-Mail: A
Comparison of a Naïve Bayesian and a Memory-Bases Approach. In
Proceedings of the 4th European Conference on Principles and Practice of
Knowledge Discovery in Databases (PKDD-2000),(Eds, Zaragoza, H.,
Gallinari, P. and Rajman, M.) Lyon, France,, pp. 1-13.
Androutsopoulos, I., Paliouras, G. and Michlakis, E. (2004). Learning to Filter
Unsolicited Commercial E-Mail. Technical Report, Demokritis.
Bentley, J. (2000) Programming Pearls, Addison Wesley Professional.
Cohen, W. W. (1996) Learning rules that classify e-mail. In Proceedings of the 1996
AAAI Spring Symposium on Machine Learning in Information Access.
Cover, T. M. and Thomas, J. A. (1991) Elements of Information Theory, New York:
John Wiley & Sons, INC..
Domingos, P. and Pazzani, M. (1997) On the optimality of the simple Bayesian
classifier under zero-one loss. Machine Learning, 29, 103-130.
Drummond, C. and Holte, R. C. (2004) What ROC Curves Can't Do (and Cost Curves
Can). In Proceedings of ROC Analysis in Artificial Intelligence (ROCAI2004)Valencia.
Fawcett, T. (2003a) "In-vivo" spam filtering: A challenge problem for data mining.
Hewlett Packard Laboratories.
Fawcett, T. (2003b) ROC Graphs: Notes and Practical Considerations for Data
Mining Researchers. Hewlett Packard Laboratories.
Fisher, M. J., Fieldsend, J. E. and Everson, R. M. (2004) Precision and Recall
Optimisation for Information Access Tasks. In Proceedings of ROC Analysis
in Artificial Intelligence (ROCAI-2004)Valencia.
Flach, P. A. (2004) The many aces of ROC analysis in machine learning. In
Proceedings of 21st International Conference on Machine Learning (ICML'04)
Flach, P. A. and Wu, S. (2003) Repairing concavities in ROC curves. In Proceedings
of 2003 UK Workshop on Computational Intelligence University of Bristol,
pp. 38-44.
Forman, G. H. (2004) Feature selection for two-class classification systems. United
States Patent Application: 20040059697.
Improving Spam Filtering by Training on Artificially Generated Text
- 54 -
Fu, C.-L. and Silver, D. L. (2004) Time-Sensitive Sampling for Spam Filtering. In
Proceedings of Canadian Conference on AI Acadia University, pp. 551-553.
Furnkranz, J. and Flach, P. A. (2003) An analysis of rule evaluation metrics. In
Proceedings of the 20th International Conference on Machine Learning
(ICML'03)AAAI Press, pp. 202-209.
Graham, P. (2002) A Plan for Spam. Available:
http://www.paulgraham.com/spam.html.
Graham, P. (2003) Better Spam Filtering. Available:
http://www.paulgraham.com/better.html.
Heise (2004) Spam-Welle überrollt die TU Braunschweig. Available:
http://heise.de/newsticker/meldung/print/47575. Heise Zeitschriften Verlag.
Hillis, W. D. (1990). Co-Evolving Parasites Improve Simulated Evolution as an
Optimization Procedure. Physica D, 42, 228-234.
Holden, S. (2002a) Spam Filtering. Available: http://sam.holden.id.au/writings/spam/.
Holden, S. (2002b) Spam Filtering II. Available:
http://sam.holden.id.au/writings/spam2/.
Katirai, H. (1999). Filtering Junk E-Mail: A Performance Comparison between
Genetic Programming & Naïve Bayes. MSc thesis, University Of Waterloo
Kernighan, B. W. and Pike, R. (1999) The Practice of Programming.
Lachiche, N. and Flach, P. A. (2003) Improving accuracy and cost of two-class and
multi-class probabilistic classifiers using ROC curves. In Proceedings of the
Twentieth International Conference on Machine Learning (ICML2003),Washington DC.
Lewis, D. (1991) Evaluating Text Categorization. In Proceedings of Speech and
Natural Language Workshop Morgan Kaufmann, pp. 312-318.
Louis, G. (2003a) Bogofilter Calculations: Comparing Bayes Chain Rule with Fisher's
Method for Combining Probabilities. Available:
http://www.bgl.no/bogofilter/BcrFisher.html.
Louis, G. (2003b) Paul Graham's Refinements to Bayesian Filtering. Available:
http://www.bgl.nu.bogofilter/graham.html
Michelakis, E., Androutsopoulos, I., Paliouras, G., Sakkis, G. and Stamatopoulos, P.
(2004) Filtron: A Learning-Based Anti-Spam Filter. In Proceedings of 2004
Spam Conference.
Minsky, M. and Papert, S. (1969) Perceptrons, Cambridge, MA: MIT Press.
Mitchell, T. M. (1997) Machine Learning, , Singapore: McGraw-Hill..
Improving Spam Filtering by Training on Artificially Generated Text
- 55 -
Provost, F., Fawcett, T. and Kohavi, R. (1998) The Case Against Accuracy Estimation
for Comparing Induction Algorithms. In Proceedings of the Fifteenth
International Conference on Machine Learning (ICML-98)(Ed, Shaulik, J. W.)
Morgan Kaufman, Madison, WI.
Rennie, J. D. M., Shih, L., Teevan, J. and Karger, D. R. (2001) Tackling the Poor
Assumptions of Naïve Bayesian Text Classifiers.
Sahami, M., Dumais, S., Heckerman, D. and Horvitz, E. (1998) A Bayesian Approach
to Filtering Junk E-Mail. In Proceedings of Learning for Text Categorization.
AAAI Press, pp. 55-62.
Sakkis, G., Androutsopoulos, I., Olaiouras, G., Karkaletsis, V., Spyropoulos, C. D.
and Stamatopoulos, P. (2001) Stacking classifiers for anti-spam filtering of email. In Proceedings of Conference on Empirical Methods in Natural
Language Processing(Eds, Lee, L. and Harman, D.) Carnegie Mellon
University, Pittsburgh, PA, USA, pp. 44-50.
Shannon, C. E. (1948) A Mathematical Theory of Communication. The Bell System
Technical Journal, 27, 379-423, 623-656.
Sophos (2003) Spam: A many rendered thing. Sophos White Paper.
SpamAssassin (2004) SpamAssassin: Tests Performed. Available:
http://spamassassin.apache.org.tests.html.
Spamhaus (2004) The Definition of Spam. Available:
http://www.spamhaos.org/definition.html
Spia, J. B. (2003) Spam E-mail and Its Impact on IT Spending and Productivity.
Basex.
Sutton, R. S. and Barto, A. G. (1998) Reinforcement Learning: An Introduction,
Cambridge, Massachusetts : The MIT Press.
Symantec (2004) Spam Statistics. Available: http://brightmail.com/spamstats.html .
Yang, Y. and Pedersen, J. O. (1997) A Comparative Study on Feature Selection in
Text Categorization. In Proceedings of the 14th International Conference on
Machine Learning(ICML-97).
Yerazunis, W. S. (2004) The Spam-Filtering Accuracy Plateau at 99.9% Accuracy
and How to Get Past It. In Proceedings of the 2004 MIT Spam Conference.
MIT, Cambridge, Massachusetts.
Improving Spam Filtering by Training on Artificially Generated Text
- 56 -
Appendix 1: Source code
Due to restrictions on space, only 2 of 16 classes are appended: The implementation
of a Markovian spam filter and the word-level text generator.
MarkovianSpamFilter.java
/**
* @author Torgeir-ts3904 current version 14.07.2004
*/
package org.tsorvik.spam.training;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.StringTokenizer;
import java.util.TreeMap;
/**
* String[] createChain(), generates a set of token-chains from a set of tokens, <br>
* e.g. <b>{A,B,C} -> {[A],[A,B],[A,C],[A,B,C]} </b>, |n|->2^(|atoms|-1)
*/
public String[] createChain(String[] atoms) {
if (lengthOfMarkovChain == 1){
weights[0] = 1;
return atoms;
}
String[] chain = new String[numberOfChains];
for (int subChain = 0; subChain < numberOfChains; subChain++) {
chain[subChain] = atoms[0];
weights[subChain] = 1;
}
int length=1;
for (int conc = 1; conc < lengthOfMarkovChain; conc++) {
length = (int)Math.pow(2,conc-1);
boolean addConc = false;
/**
* Spam-filter class which implements a naive Bayesian Classifier enhanced to classify on token-chains
* (nGrams) following model of Yerazunis(2004), Peng, * et.al(2004) and others. A Markovian Spam
* filter using token-chains of length=1 is a normal naive Bayes Filter
*/
public class MarkovianSpamFilter {
private final String stopCharacters = " \t\n\r\f;(),.@";
private final int lengthOfMarkovChain; //1,2,3,4,5
private final int numberOfChains; //1,2,4,8,16
private final int TP = 0, FN = 1, FP = 2, TN = 3;
private final double hamIncrement;
private int spamMessageCounter = 0;
private int hamMessageCounter = 0;
private int hamTokens = 0;
private int spamTokens = 0;
private int[] confusionMatrix = new int[4];
private TreeMap vocabulary;
private int[] weights;
/**
* Default Constructor Instantiates a naive Bayes classifier
*/
public MarkovianSpamFilter() {
vocabulary = new TreeMap();
lengthOfMarkovChain = 1;
numberOfChains = 1;
hamIncrement = 1;
}
/**
* Constructor Initialises a Markovian Spam filter, giving the chain-length and the step-size of
* incremental occurance counter for ham (bias) The att numberOfChains is derived from
* chainlength for simplisity
*/
public MarkovianSpamFilter(int chainLength, double hamIncrement) {
vocabulary = new TreeMap();
lengthOfMarkovChain = chainLength;
numberOfChains = (int)Math.pow(2, chainLength-1);
weights = new int[numberOfChains];
this.hamIncrement = hamIncrement;
}
for (int subChain = 0; subChain < numberOfChains; subChain++) {
if (addConc == true) {
chain[subChain] = ((new StringBuffer(chain[subChain]))
.append("|").append(atoms[conc])).toString();
weights[subChain]++;
}
if (((subChain + 1) % (length)) == 0) {
if (addConc == true)
addConc = false;
else
addConc = true;
}
}
}
return chain;
}
/**
* My interpretation of markovian learning. Reads a message and counts frequencies of token
* chains given class.
*/
public void learnMarkovText(File message) {
String file = message.getName();
String line = "";
ArrayList messageContents = new ArrayList(2000);
// Find classification of message to parse
int classification = 1;
if (message.getName().charAt(0) == 'S') {
classification = 0;
spamMessageCounter++;
} else
hamMessageCounter++;
// Read contents of message into an ArrayList to allow seamless
// traversal through text
try {
BufferedReader reader = new BufferedReader(new FileReader(message));
while ((line = reader.readLine()) != null) {
StringTokenizer tokenizer = new StringTokenizer(line,
stopCharacters);
while (tokenizer.hasMoreTokens()) {
messageContents.add(tokenizer.nextToken());
if (classification == 0)
spamTokens++;
else
hamTokens++;
}
}
reader.close();
} catch (IOException IOError) {
System.err
.println("LEARN_NAIVE_BAYES_TEXT() -> exception
while reading file:\n\t" + IOError);
}
// fill the firs n places of a array
String[] tokenWindow = new String[lengthOfMarkovChain];
for (int chain = 0; chain < lengthOfMarkovChain; chain++)
tokenWindow[chain] = messageContents.get(chain).toString();
double spaminess = 0;
double PSpam = Math.log((spamMessageCounter + 0.5)
/ (hamMessageCounter + spamMessageCounter + 0.5));
double PHam = Math.log((hamMessageCounter + 0.5)
/ (hamMessageCounter + spamMessageCounter + 0.5));
double ammountSpam = spamTokens + vocabulary.size();
double ammountHam = hamTokens + vocabulary.size();
String[] tokenWindow = new String[lengthOfMarkovChain];
for (int chain = 0; chain < lengthOfMarkovChain; chain++)
tokenWindow[chain] = messageContents.get(chain).toString();
// for every possible position of token window
for (int atomicToken = 3; atomicToken < messageContents.size(); atomicToken++) {
String[] currentChains = createChain(tokenWindow);
double chainMembershipInSpam = 0;
double chainMembershipInHam = 0;
for (int atomicToken = lengthOfMarkovChain; atomicToken < messageContents.size();
atomicToken++) {
String[] currentChains = createChain(tokenWindow);
// for all chains made from atoms[], add a probability to the probs buffer
for (int thisChain = 0; thisChain < currentChains.length; thisChain++) {
String currentChain = currentChains[thisChain];
for (int thisChain = 0; thisChain < currentChains.length; thisChain++) {
String currentChain = currentChains[thisChain];
double pWordSpam = 0;
double pWordHam = 0;
if (vocabulary.containsKey(currentChain)) {
if (classification == 0)
((TokenValue) vocabulary.get(currentChain)).addSpam();
else
((TokenValue) vocabulary.get(currentChain)).addHam();
} else {
vocabulary.put(currentChain, new TokenValue(classification,
hamIncrement, weights[thisChain]));
}
}
if (vocabulary.containsKey(currentChain)) {
TokenValue current = ((TokenValue) vocabulary
.get(currentChain));
double weight = 0.5 +
(weights[thisChain]/((double)numberOfChains));
pWordSpam = (current.getSpamCount() + 0.5)/
ammountSpam;
pWordHam = (current.getHamCount() + 0.5) / ammountHam;
// Slide the window one position
for (int chain = 0; chain <= lengthOfMarkovChain - 2; chain++)
tokenWindow[chain] = tokenWindow[chain + 1];
PSpam += Math.log(pWordSpam * weight);
PHam += Math.log(pWordHam * weight);
}
}
tokenWindow[lengthOfMarkovChain - 1] = messageContents.get(
atomicToken).toString();
// slide the window one step
for (int chain = 0; chain <= lengthOfMarkovChain - 2; chain++)
tokenWindow[chain] = tokenWindow[chain + 1];
}
messageContents = null;
}
/**
* Tokenises a inputed message and calculates spaminess based on spaminessMapping.
*/
public double classifyMarkovText(File message, double decisionThreshold) {
String line = "";
ArrayList messageContents = new ArrayList(4000);
try {
BufferedReader reader = new BufferedReader(new FileReader(message));
while ((line = reader.readLine()) != null) {
StringTokenizer tokenizer = new StringTokenizer(
line, stopCharacters);
while (tokenizer.hasMoreTokens())
messageContents.add(tokenizer.nextToken());
}
} catch (IOException IOError) {
System.err.println("LEARN_NAIVE_BAYES_TEXT() -> exception while
reading file:\n\t" + IOError);
}
tokenWindow[lengthOfMarkovChain - 1] = messageContents.get(
atomicToken).toString();
}
messageContents = null;
spaminess = PHam/PSpam;
char trueClass = message.getName().charAt(0);
if (trueClass == 'S') {
if (spaminess > decisionThreshold)
confusionMatrix[TP]++;
else {
confusionMatrix[FN]++;
}
} else {
if (spaminess > decisionThreshold) {
confusionMatrix[FP]++;
} else
confusionMatrix[TN]++;
}
return spaminess;
} //CLASSIFY_MARKOV_TEXT
/**
* Use misclassifications as source of correction
*/
public double classifyAndLearnOnErrors(File message, double decisionThreshold) {
double sp = classifyMarkovText(message, decisionThreshold);
char trueClass = message.getName().charAt(0);
if (trueClass == 'S' && sp < decisionThreshold)
learnMarkovText(message);
if (trueClass != 'S' && sp > decisionThreshold)
learnMarkovText(message);
return sp;
}
/**
* Displays a confusion matrix.
*/
public final void print() {
System.out.println("Size: " + vocabulary.size());
System.out.println("\n----------------- ");
System.out.println("| " + confusionMatrix[TP] + "\t| "
+ confusionMatrix[FN] + "\t|");
System.out.println("----------------- ");
System.out.println("| " + confusionMatrix[FP] + "\t| "
+ confusionMatrix[TN] + "\t|");
System.out.println("-----------------");
}
/**
* Gives the mapping
*/
public TreeMap getMapping(){
return vocabulary;
}
} CLASS
WordLevel_MailGenerator.java
/**
* Created on 23-Jul-2004
* By @author ts3904
*/
package org.tsorvik.spam.training;
import java.util.TreeMap;
import java.util.ArrayList;
import java.util.StringTokenizer;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
/**
* Builds a Markov model from training data and generates new text based on
* transitional probabilities on word-level.
*/
public class WordLevel_MailGenerator {
private final int neutralWordsLength = 1000;
private final int markovLength;
private final char classification;
private int outputFileCounter = 1;
private TreeMap mapping = new TreeMap();
private ArrayList trainingText;
private ArrayList firstPrefix;
private TreeMap vocabulary = null;
private ArrayList neutralWords = null;
/**
* Constructor
*/
public WordLevel_MailGenerator(int length, char classe) {
markovLength = length;
classification = classe;
trainingText = new ArrayList(2000 * length);
firstPrefix = new ArrayList(20);
}
/**
* Add to the generator a mapping extracted from training set
*/
public void setMap(TreeMap mapping) {
vocabulary = mapping;
neutralWords = new ArrayList(neutralWordsLength);
Object[] keys = vocabulary.keySet().toArray();
int iterator = 0;
int toughCounter = 0;
while (iterator < keys.length && toughCounter < neutralWordsLength) {
String thisToken = keys[iterator].toString();
TokenValue tv = (TokenValue) vocabulary.get(thisToken);
double tokenProb = tv.getProbability();
if (tokenProb > 0.2 && tokenProb < 0.8 && tv.getCount() > 2) {
neutralWords.add(thisToken);
toughCounter++;
}
iterator++;
}
}
mapping.put(prefix, new ArrayList());
((ArrayList) mapping.get(prefix)).add(suffix);
/**
* Analyze trainingset and extract transitional probabilities
*/
public void learnMarkovModel(String path) throws IOException {
File trainingDir = new File(path);
File[] trainingSet = trainingDir.listFiles();
String[] window = new String[markovLength];
}
// Update the "sliding Window"
for (int s = 0; s < (markovLength - 1); s++)
window[s] = window[s + 1];
window[markovLength - 1] = trainingText.get(w).toString();
}
}
// Read all trainingdata into memory
try {
for (int trainingFile = 0; trainingFile < trainingSet.length; trainingFile++) {
if (trainingSet[trainingFile].getName().charAt(0) == classification) {
BufferedReader reader = new BufferedReader(
new FileReader(trainingSet[trainingFile]));
String line = "";
boolean startOfMessage = true;
while ((line = reader.readLine()) != null) {
if (startOfMessage == true) {
startOfMessage = false;
StringTokenizer readPrefix = new
StringTokenizer(line);
ArrayList start = new ArrayList();
for (int l = 0; l < markovLength; l++){
start.add(readPrefix.nextToken());
}
firstPrefix.add(start);
}
/**
* Generates a new file by semi-randomly picking next word in scentance
*/
public String generateText(String path) throws IOException {
// create a startingpoint for text generation
ArrayList start = (ArrayList) firstPrefix.get((int) (Math.random() * firstPrefix.size()));
String[] window = new String[markovLength];
String prefix = "";
for (int j = 0; j < markovLength; j++) {
window[j] = start.get(j).toString();
prefix = prefix.concat(window[j]);
}
String suffix = "";
// create outputfile
String filePath = path + classification + "ART" + outputFileCounter + ".txt";
PrintWriter out = new PrintWriter(new BufferedWriter(new FileWriter(filePath)));
for (int p = 0; p < markovLength; p++){
out.println(window[p]);
}
StringTokenizer tokenizer = new StringTokenizer(
line);
while (tokenizer.hasMoreTokens()){
trainingText.add(tokenizer.nextToken());
}
boolean endMessage = false;
while (endMessage == false) {
ArrayList posSuff = (ArrayList) mapping.get(prefix);
try {
suffix = posSuff.get((int) (Math.random() * posSuff.size())).toString();
if (suffix.equals("From") || suffix.equals("\n"))
endMessage = true;
else
out.println(suffix);
} catch (NullPointerException e) {
endMessage = true;
}
}
reader.close();
}
}
} catch (java.util.NoSuchElementException e) {
System.err.println("An exception occured during tokenising:\n\t" + e);
System.err.println("Remove files generated in previous runs");
System.exit(0);
}
try {
for (int w = 0; w < markovLength; w++){
window[w] = trainingText.get(w).toString();
}
} catch (IndexOutOfBoundsException e) {
System.err.println(e.toString() + "\nNo such path, or empty dir: " + path);
}
String prefix = "";
String suffix = "";
for (int w = markovLength; w < trainingText.size(); w++) {
prefix = "";
for (int j = 0; j < markovLength; j++)
prefix = prefix.concat(window[j]);
suffix = trainingText.get(w).toString();
// Check if this prefix is in mapping by trying to add a suffix.
// If it is not mapped (if it fails), add a entry and a suffix
try {
((ArrayList) mapping.get(prefix)).add(suffix);
} catch (NullPointerException e) {
for (int s = 0; s < (markovLength - 1); s++){
window[s] = window[s + 1];
}
window[markovLength - 1] = suffix;
prefix = "";
for (int j = 0; j < markovLength; j++){
prefix = prefix.concat(window[j]);
}
}
out.close();
outputFileCounter++;
return filePath;
}
/**
* Generates a new file by semi-randomly picking next word in scentance
* Adds tough words to the file; tokens not considered as clear signs of spam nor ham
*/
public String generateToughText(String path) throws IOException {
// create a startingpoint for text generation
ArrayList start = (ArrayList) firstPrefix.get((int) (Math.random() * firstPrefix.size()));
String[] window = new String[markovLength];
String prefix = "";
String suffix = "";
// create a startingpoint for text generation
ArrayList start = (ArrayList) firstPrefix.get((int) (Math.random() * firstPrefix.size()));
String[] window = new String[markovLength];
String prefix = "";
for (int j = 0; j < markovLength; j++) {
window[j] = start.get(j).toString();
prefix = prefix.concat(window[j]);
}
String suffix = "";
for (int j = 0; j < markovLength; j++) {
window[j] = start.get(j).toString();
prefix = prefix.concat(window[j]);
}
// create outputfile
String filePath = path + classification + "ART" + outputFileCounter + ".txt";
PrintWriter out = new PrintWriter(new BufferedWriter(new FileWriter(filePath)));
for (int p = 0; p < markovLength; p++)
out.println(window[p]);
// create outputfile
String filePath = path + classification + "ART" + outputFileCounter + ".txt";
PrintWriter out = new PrintWriter(new BufferedWriter(new FileWriter(filePath)));
for (int p = 0; p < markovLength; p++){
out.println(window[p]);
}
boolean endMessage = false;
int messageLength = 0;
boolean endMessage = false;
int messageLength = 0;
while (endMessage == false) {
messageLength++;
ArrayList posSuff = (ArrayList) mapping.get(prefix);
try {
suffix = posSuff.get((int) (Math.random() * posSuff.size())).toString();
while (endMessage == false) {
messageLength++;
ArrayList posSuff = (ArrayList) mapping.get(prefix);
try {
suffix = posSuff.get((int) (Math.random() * posSuff.size())).toString();
if (suffix.equals("From") || suffix.equals("\n")
|| suffix.equals("From ")){
endMessage = true;
}else{
out.println(suffix);
}
} catch (NullPointerException e) {
endMessage = true;
}
if (suffix.equals("From") || suffix.equals("\n") || suffix.equals("From "))
endMessage = true;
else
out.println(suffix);
} catch (NullPointerException e) {
endMessage = true;
}
for (int s = 0; s < (markovLength - 1); s++){
window[s] = window[s + 1];
}
window[markovLength - 1] = suffix;
prefix = "";
for (int j = 0; j < markovLength; j++){
prefix = prefix.concat(window[j]);
}
for (int s = 0; s < (markovLength - 1); s++){
window[s] = window[s + 1];
}
window[markovLength - 1] = suffix;
prefix = "";
for (int j = 0; j < markovLength; j++){
prefix = prefix.concat(window[j]);
}
}
//Add 'though' words
try {
for (int addInnocent = 0; addInnocent < (messageLength); addInnocent++)
out.println(tw[((int) (Math.random() * tw.length))]);
} catch (NullPointerException e) {
System.err.println("List of neutral words have not been generated, skipping..");
}
}
// Add 'though' words
try {
for (int addInnocent = 0; addInnocent < (messageLength); addInnocent++)
out.println(neutralWords.get((int) (Math.random() *
neutralWords.size())));
} catch (NullPointerException e) {
System.err.println("List of neutral words have not been generated, skipping..");
}
out.close();
outputFileCounter++;
return filePath;
}
/**
* Generates a new file by semi-randomly picking next word in scentance
* Adds tough words to the file; tokens whith a given entropy
*/
public String generateToughText(String path, String[] tw)throws IOException {
out.close();
outputFileCounter++;
return filePath;
}
}// CLASS
Appendix 2: Poster
Improving Spam Filtering
by training on
artificially generated text
A MSc dissertation by Torgeir Sorvik, supervised by Tim Kovacs
University of Bristol, Department of Computer Science
Motivation and objective
In spam filtering, the amount of
training data available will affect the
accuracy of predictions:
When starting a spam filter it will not
perform satisfactory until the user has
received a sufficient amount of e-mail.
By adding spam-like data to the training
set, the period of time it takes to achieve
good performance decreased.
The effect of training
1
AUC
0.98
0.96
0.94
0.92
0.9
10
20
50
200
1000
Size of trainingset
Interpolating
To simulating incoming email, text is generated
based on a model built from
the actual e-mail received.
abc
xyz
training set
I call this process of
increasing the amount of
data Interpolating.
Artificial text is generated
based on transitional
probabilities observed in
training set through a
Markov chain analysis.
W
a
b
c
d
e
tokeniser
Pr
.99
.91
.03
.72
.10
Ent Trans
.02 b,f,m
.03 c,d
.13 d,a,o
.35 a,s
.21 k
statistical model
?
generator
simulated mail
Training
abc
xyz
training set
put()
tokeniser
a 0.2
b 0.7
c 0.03
mapping
Classifying
The new text is added to the
training set.
abc
xyz
message
tokeniser
get()
a 0.21
b 0.70
c 0.03
d 0.99
e 0.87
mapping
Results
Experiments show that Bayesian
spam filters benefits from having the
training increased with artificially
generated spam
In early stages of learning the filter
makes fewer FP errors and
becomes useful at an earlier stage
∑
%
calculations
classification
Conclusion
Training spam filters on
artificially generated text,
especially imitation of spam,
improves spam filtering in early
stages by reducing the
number of FP errors.
The method has not proved
successful for later stages of
filtering, where size of
training set > 400 messages.
The findings in this dissertation
suggests that spam interpolating
could be applied with success to
personal Bayesian spam filters