Anomaly Detection in Email Traffic

POSTER 2015, PRAGUE MAY 14
1
Anomaly Detection in Email Traffic
Tomáš Gogár
Dept. of Cybernetics , Czech Technical University, Karlovo namesti 13, 121 35 Prague, Czech Republic
[email protected]
Abstract. Existing email antispam techniques are able to
filter out majority of unsolicited bulk mail, however in recent years a new type of emails started to appear in user inboxes - newsletters and notifications. Such emails are often
referred to as graymail and they are usually sent to users who
have subscribed for them. Usual graymail senders (such
as well established e-commerce services) do not send unsolicited graymail and give receivers the ability to unsubscribe.
Last year, the largest freemail service provider in the Czech
Republic (Seznam.cz) discovered a new spamming behavior.
Spammers use stolen databases of email addresses and send
unsolicited emails, which resemble usual newsletters. Standard antispam techniques fail to filter out such messages,
because spammers mask themselves by frequent changes of
addresses and the similarity to standard newsletters makes
the content filtering impossible. In this work, we try to detect the masking behavior of spammers by proposing new
email descriptors and using anomaly detection algorithms.
The preliminary results from testing on small set of labeled
emails suggests that the majority of anomaly emails represents unsolicited bulk mails and that such approach should
help in identifying a significant portion of spam senders.
Keywords
Email, spam, graymail, newsletters, anomaly detection.
1. Introduction
Since email system provides very cheap and easy way
of communication, it is very often used for sending messages
to a large audience. A message which is sent to large number
of recipients with none or only small number of changes in
its body is called a Bulk email.
significant costs on the recipient’s side (costs for the infrastructure and storage), as well as it reduces the usability of
the email system itself. Therefore the effort to automatically
filter UBEs is one of the main topics within the email developers community.
1.2. Graymails
In addition to spam, other types of bulk emails are sent.
The term graymail is used for emails which users have subscribed for. The problem is, that spam filtering systems cannot determine, whether the message is being delivered only
to the subscribed users, since no one has an exact history of
the user’s behavior. So graymails are usually delivered and
only if a large number of users report it as unsolicited, the
sender can be retrospectively marked as a spammer.
1.3. Unsolicited newsletters
Unsolicited newsletters represent a new spamming
trend which has been observed by Seznam.cz (the largest
email service provider in the Czech Republic). The provider
observed that millions of unwanted emails whose content resembles usual newsletter are sent to email addresses of their
users. Further analysis showed that there are several spammers that misuse large databases of email addresses to promote various e-commerce services. Spammers send emails
which appear as standard newsletters of the promoted company, but from the reaction of users it is evident that users
did not subscribe to receive them. Since such messages are
just another form of spam, we would like to filter them out.
However, it is not a trivial task, mainly for the following reasons:
• The bodies of emails resemble legal newsletters, so it is
difficult to filter them out with content filters.
1.1. Unsolicited bulk emails
When a bulk message was not requested by the recipient, we refer to it as Unsolicited Bulk Mail (UBE) or Email
spam1 . The huge amount of unsolicited bulk emails cause
1 Conversely,
solicited emails are often referred to as ham.
• Spammers have many potential customers, so we do not
know which company they will promote next.
• Emails are not sent from company mail servers as usual
newsletters, therefore it is not possible to create consistent reputation statistics for promoted companies.
2
T. GOGÁR, ANOMALY DETECTION IN EMAIL TRAFFIC
• Spammers are trying to hide themselves by means of
changing their IP addresses, domains, etc.
1.4. Our goal
During the last year Seznam.cz identified two biggest
senders of Illegal newsletters and developed a system which
is able to block these spammers. Unfortunately, in the training phase this approach requires a set of labeled emails from
the particular spammer. Such identification is not trivial and
requires work of a human operator, who manually analyzes
received emails. The goal of this work is to propose a system which will help the human operator in distinguishing
unsolicited newsletters from other graymail. Our effort to
find an semi-automated solution for this task is important
because we expect there is more similar senders which Seznam.cz have not identified yet and manual search is highly
inefficient.
2. Email characteristics
Before we dive into spam filtering techniques, we
will summarize, what sort of information is available when
an email arrives to a server, which is supposed to decide
whether it is ham or spam. We divide data from an email
into three components based on its source - SMTP envelope,
Email headers, Email content. In the following paragraphs
we briefly describe the data (example of email structure is in
Figure 1).
2.1. SMTP envelope
The process when two mail servers communicate in order to exchange an email is called SMTP session. At the
beginning of the session the sending server (often referred to
as client in this context) provides the following information:
• HELO (or EHLO) string String that identifies the
sending server - usually with its fully qualified domain name (FQDN). EHLO string is used within new
ESMTP protocol.
• MAIL FROM This field is used to present the originator’s email address. It is also the address where notifications (bounces) of undelivered messages should be
sent.
• RCPT TO Recipient’s email address. It can be used
multiple times, in case of multiple recipients.
Since these data fields serves to identify the recipient
and the sender, it is often referred to as an Email (or SMTP)
Envelope. After successful negotiation the sending server
starts to send email data. Email data contain email headers
and email content.
2.2. Email headers
The headers include all other structured information
and are part of Data stream within the SMTP session.
Mandatory fields are:
• From The field should include email address (and optionally name) of an author.
• Date Date and time when an email was composed.
And we only mention other optional fields: MessageID, InReply-To, To, Subject, Bcc, Cc, Content-Type, Precedence,
References, Reply-To, Sender, Archived-At [1].
2.3. Email content
Fig. 1. Data structure of an email - SMTP envelope is sent first
followed by email data. We can always see the IP address
of sending machine.
Although email system was originally designed only
for plain text messages, nowadays either plain text or html is
used to represent email content. Since html provides more
options for graphical expression and interactivity - it is usually the sender’s choice for marketing emails. Usual features
found in html emails are: Formatted text, Images, Outgoing
links. Emails can also include multimedia attachments of
limited size.
POSTER 2015, PRAGUE MAY 14
3
3. Antispam techniques
Since spam is a very concerning issue, a lot of work on
spam filtering has been done in the last two decades [2, 3].
One of the simplest method to filter out spam senders
is to employ blacklisting techniques. These blacklists can
be shared between email service providers (ESPs) or can be
created using sender’s reputation statistics, which are built
from the reactions of freemail users. Unfortunately, this approach requires that the sender use constant domains and IP
addresses, which is not our case.
Content-based methods use the text content of the message as the main source of information for spam filtering.
The most successful methods use simple bag-of-words models with TF-IDF weighting together with standard classifiers
such as Naive Bayes, SVM, etc. [4]. Problem of these techniques is that they cannot filter spam emails which use similar language as ham emails. This is the main reason why
they cannot capture unsolicited newsletters.
The lack of domain and IP information can have few possible
causes:
4. Data
Our data set consists of logs which record basic information about each incoming email (as depicted in section
2). Seznam.cz provided logs for 5 days. Since our goal is
to improve the current system, we used only data for emails
which passed all the existing filters and ended up in user inboxes. Statistics of delivered messages are summarized in
Table1 2 . We consider an email as a bulk mail, if the system
detects similar emails which are delivered to more than 500
distinct users. In this work we refer to the set of similar bulk
messages as a bulk group.
Date
02/08/2014
02/09/2014
04/08/2014
04/14/2014
04/15/2014
Fig. 2. Our unit as a part of the whole filtering system - the unit
is used for emails which passed standard email filtering
(here it is SpamAssasin unit) and which were sent by
unknown sender.
Total delivered messages
Not available
Not available
44,700,490
41,334,968
43,434,874
Delivered bulk messages
17,922,898
17,403,976
25,434,898
21,218,346
23,385,524
Bulk groups
1,832
1,906
2,241
2,479
2,788
Tab. 1. Data summary - the size of the whole data set is 500 GB.
• It is a new legitimate sender.
• Some known sender changed his domains and IP addresses.
• It is a sender, who is trying to mask himself and hide
behind another identity.
In this work we are proposing features and an algorithm,
which will help to distinguish the last situation from the
others. In order to achieve that, we make a reasonable assumption that legal senders of bulk messages try to be consistent in their self-presentation, so that their customers always know, where to find additional information (i.e. they
use similar addresses in headers, they have functional websites, etc.). On the other hand, spammers which need to mask
themselves cannot behave in such consistent manner because
they would get blocked by reputation systems. In the following paragraphs we describe features, which should capture
sender’s consistency.
5. Design
We have already mentioned common methods of spam
filtering in Section 3. Our system should work complementary to other antispam systems (such as content-based filters, reputation DB) and it should process emails which have
passed standard antispam tests and which come from unknown domains and IP addresses (so we have not enough
reputation statistics, see Figure 2).
2 There is considerably less bulk mails for the first two days. This difference is caused by the way how the bulk mails were detected in older version
of the logging system.
5.1. Features
Since we need features that describe consistency of
sender’s self-presentation, we do not consider each email
separately but we extract features from the whole group of
similar emails which were sent to many users, i.e. for particular Bulk group3 . Focusing on the groups of similar emails
results in significant data reduction - instead of analyzing
3 This approach requires reliable algorithm for detecting similar emails.
Seznam.cz has its own similarity detection algorithm and in this work we
presume that it works reliably.
4
T. GOGÁR, ANOMALY DETECTION IN EMAIL TRAFFIC
millions of emails directly, we process thousands of Bulk
groups. The first three features describe the sending consistency:
• Number of distinct MAIL FROM domains. Envelope field MAIL FROM serves as an address for bounce
messages. Senders often want to make statistics of undelivered emails and so they don’t use only one address,
but rather multiple structured addresses that can look
like: [email protected].
Even though address may differ, the second level domain (here: example.com) can be expected to be the
same. This feature computes number of distinct second level domains, which are extracted from the domain part of the address (the part following @).
• Number of distinct From addresses. From header is
the address that is displayed to the receiver and we expect it to be consistent over the whole set of messages.
Therefore this feature uses the whole address and not
only the second level domain.
• Number of emails sent per IP. Since spammers need
to change their IP addresses, they are often forced to use
them as much as possible, which can result in abnormal
number of emails sent per one IP address.
Next two features focus on properties of domains that
are used in message headers - MAIL FROM address and
From address. Our goal is to describe whether the domains
are used for promotion of the sender. We suppose that wellknown and trustful internet entities possess comprehensive
websites. Such websites are usually well structured, often
very vast and known by the customers. In order to estimate
whether the domain is used for promotion we focused on
two properties - size of the website (i.e. number of unique
pages within the site) and it’s Page-rank [5]. The size of
the domain is estimated with simple crawler which computes
unique pages accessible from the domain’s root page. During the implementation we discovered that the size of the
website does not provide enough information, since there is
a lot of ajax-based sites, which appear as a single page to
our crawler4 . Therefore we added another property which
should describe the relevance and credibility of the site. For
this purpose we used Google PageRank, which we obtained
through their public API. Having these two properties we
arrived to a simple formula which we use to define Idle domain. Domain is idle if the following holds:
[W S = 0 ∨ W S = 1] ∧ [P R = 0]
(1)
where W S is the size of website and P R is its PageRank.
When Idle domain is defined we use the results for every
message to compute the statistics for the whole Bulk group.
Therefore the two final features are:
4 Javascript
work.
crawling is not a trivial task and it is out of scope of this
• Ratio of idle domains in MAIL FROM headers
• Ratio of idle domains in From headers
For the purpose of the basic analysis we have grouped
more than 16 million emails from the first two days (the beginning of February 2014, see Section 4) into more than 3
thousand Bulk groups, we have computed the features and
summarized some statistics in Table 2.
Feature
MailFrom domains
From addresses
Mails per IP
Idle From
Idle MAIL FROM
Statistics
88% of bulk groups use 1 MailFrom domain.
95% of bulk groups use less than 3 MailFrom domains.
76% of bulk groups use 1 From address.
89% of bulk groups use less than 3 From addresses.
95% of senders sends less than 14,000 emails per IP.
83% of senders does not use idle domains at all.
5% of senders use only idle domains.
86% of senders does not use idle domains at all.
11% of senders use only idle domains.
Tab. 2. Selected statistics of individual features
We can see that majority of senders behave similarly
and as expected - most of them use one From Address, one
MAIL FROM domain, they send similar number of messages per IP address and often use working (non-idle) domains in the address fields. On the other hand it is worth to
notice that we can see significant number of emails, which
were sent only from idle domains (especially in case of
MAIL FROM).
5.2. Anomaly detection
We already know that there is some standard behavior of senders and that there are some senders who behave
unusually, but it still does not suggest anything about spamminess of the ”unusual” emails. Manual examination of unusual data shows that nonstandard behavior can be related
to spamminess of emails, but it does not guarantee anything.
We don’t know in which parts of the feature space the ”ham”
and ”spam” emails reside and whether these parts are separable. In order to divide the feature space, linear classifiers
are often used (for example SVM). These algorithms need
sufficient number of labeled data for the training, but unfortunately such information is not available for our task, because it is difficult to classify graymails for human annotators. There are some emails which can be manually classified with sufficient degree of certainty, but there are not
many of them. Therefore, the only information we can work
with is that some senders behave unusually. We try to identify such senders and measure on the small subset of labeled
emails whether such identification helps in spam recognition. The task of finding patterns in data that do not conform the expected behavior is called Anomaly detection [7]
and it is often part of fraud, intrusion or fault detection systems. Therefore a lot of detection techniques have already
been developed and an extensive analysis of such algorithms
and their usage can be found in [7]. In this work we have
POSTER 2015, PRAGUE MAY 14
5
used statistical approaches, which model Probability density
function (PDF) of unlabeled data and rely on an assumption that normal data instances occur in high probability regions of the feature space while the anomalies occur in the
low probability regions. The normality of each instance is
expressed by its value of PDF - in our work we denote it
as Normal behavior score. We have examined four models - univariate and multivariate normal distribution, simple
histograms and Parzen windows model (with Gaussian kernel). It’s worth to mention that some of those methods model
probability density function of each feature separately and
assume the features to be independent (see Table 3).
Model
Univariate Gaussian
Multivariate Gaussian
Histograms
Parzen windows
Idependence
assumption
YES
NO
YES
YES
The last day was used only for comparison of particular models - we have examined how the anomalies detected
by the models correlate with spam. In the Figure 3 we have
plotted ROC curves of our models on the validation data set.
Since every model created different Probability density function (PDF), the maximum possible Normal behavior score is
different for each model. Therefore we cannot use the same
thresholds for all models, but we use 1000 linearly spaced
values between 0 and M axScorei , where M axScorei is
the maximum possible Normal behavior score for model i.
Parameters to estimate
µ, σ (using MLE)
µ, Σ (using MLE)
Bin width
Gaussian σ
Tab. 3. Anomaly detection methods. Gaussian means and variances are estimated using Maximum Likelihood Estimate. Bin width a kernel variation are estimated in the
validation phase.
6. Experiments
In this part we describe our experiments with anomaly
behavior detection and its possible influence on spam filtering. We split data into three parts for training, validation
and test purposes (as depicted in Table 4). In order to simulate real-life situation, we used the first two days of our
data set for training purposes. Notice that for this phase, we
do not need any labeled data. On the other hand, we want
to measure the correlation between abnormal behavior and
spamminess, so we need some labeled data in subsequent
phases. For the labeling purposes we have used the feedback from the users (we have analyzed how they used the
mark-as-spam button) as a hint for manual labeling and we
managed to label more than 250 emails5 . We then used data
from the first two days in April for validation of our models.
In the validation phase we estimated Normal behavior score
thresholds and also parameters of histogram and Parzen windows model, i.e. width-of-bin and variance of Gaussian kernel (more detailed description can be found in [6]).
Usage
Model training
Validation
Testing
Dates
02/08/2014
02/09/2014
04/08/2014
04/14/2014
04/15/2014
Labeled Bulk groups (Ham/Spam)
–/–
–/–
51 / 34
57 / 35
48 / 34
Tab. 4. Data usage overview
5 Since graymail labeling is difficult, we have marked message as spam
only if it had extremely bad user reputation. And similarly we marked email
as ham only if it had very good reputation and the sender was known and
reliable internet entity.
Fig. 3. ROC curves for all models
We can see that for the lower thresholds all classifiers
perform pretty well - all of them are able to recognize more
than 40% of spam with less than 6% false positive rate. The
ROC curves grows steeply at the beginning but at certain
point (approximately 0.08 FPR) the curves lay down and go
linearly to the endpoint (where all Bulk groups are classified
as anomalies). We can also see many points at the beginning
of the curves, almost no points in the center and some points
at the end. This suggests that Bulk groups usually receive either low scores or high scores. Moreover, every curve should
be formed by 1000 points, but there is not so many points.
This suggests that many Bulk groups receive similar score.
In order to validate this hypotheses, we have plotted
histogram of scores for our labeled validation data in Figure
4 6 . We can see two peaks in the histogram - the higher
peak (more than 75%) represents Bulk groups that received
the highest scores (these senders behave normally) and the
second peak (20%) which represents the Bulk groups with
the lowest scores (anomaly senders). The vast majority of
the anomaly peak consists of spam and that is the reason
why the ROC curves grows steeply at the beginning. On the
other hand we can see that approximately the same amount
of spam was sent by non-anomaly senders. These are the
senders which we will never detect with anomaly detection
algorithms based on these features.
6 We
have chosen Histogram model because the behavior is very apparent for its scores. But other models behave similarly.
6
T. GOGÁR, ANOMALY DETECTION IN EMAIL TRAFFIC
Fig. 4. Histogram of labeled Bulk groups based on scores from
Histogram model - emails receive either very low or very
high Normal behavior score.
We could bring some bias into the problem by using
only 177 manually labeled Bulk groups (where only evident
spam or ham was labeled), but we examined the histogram
of scores for unlabeled data set (4700 Bulk groups) and the
distribution is still the same, so it suggests that the bias has
no effect on the distribution of scores.
The distribution of scores shows that performance for
the lower thresholds is crucial for our task. Parzen Window
model seems to perform the best on validation data. The
same results were confirmed on the test set and therefore we
list more detailed results for this model in Table 5. The performance for the test set confirms the results from previous
sections. It gives us chance that we can detect about 50% of
spam, while keeping false positive rate less than 5%. On the
other hand it must be noted that these results are very tentative, since the performance was measured on the very small
subset of labeled data.
Threshold
2.18e-08
3.49e-07
2.71e-06
TPR/FPR
0.32/0.04
0.53/0.04
0.76/0.13
F0.5 score
0.64
0.79
0.80
Anomaly Bulk groups
270
516
1174
Tab. 5. Performance of Parzen Window model on testing data
set
In the last phase we took the 516 Bulk groups which
received the Normal behavior score lower than 3.49e-7 and
manually analyzed the content. We have focused on the topic
of the email and the sending entity (the sending entities were
intentionally estimated based on the content of an email, not
from the headers). The topic distribution of anomaly emails
can be found in a pie chart in Figure 5. As we can see almost half of the anomalies advertised some goods or services
available online. How many of these advertisements are un-
Fig. 5. Topic distribution of anomaly emails, for 11% of the
emails we could not determine the name of the sending
entity.
solicited newsletters or legal graymails is still the question,
but we can say that the behavior of these senders is suspicious. But there are other interesting categories of emails
which promote Online casinos, Dating sites, Earning opportunities, Loans, Pharmacy products, Insurance or emails
with sexual content. These categories make together 37%
of anomalies and are very likely unwanted bulk mails. We
have also discovered that for 11% of the emails we cannot determine the name of the sending entity and such selfpresentation is very suspicious.
7. Conclusions
In this work we proposed a system that uses anomaly
detection and specifically designed features in order to detect masking behavior of spammers. The proposed features,
which should describe sender’s self-presentation habits,
were extracted from the large set of unlabeled data (tens of
millions of emails) and were used to train statistical models
of normal behavior. In the testing phase the new instances
are compared against the normal behavior model and they
receive Normal behavior score. If the score is lower than a
predefined threshold we consider the email as suspicious and
it will be shown to a human operator.
Preliminary results on small set of 259 labeled emails
suggest that the system should be able to detect approximately 50% of spam while keeping the false positive rate
under 5%. Even though such results are very promising, we
take them as very tentative and we are planning more tests
on larger data sets.
We have also analyzed the distribution of Normal behavior score for labeled emails, and it suggests that approximately half of the spammers do not behave abnormally and
therefore we won’t be able to detect them by our system. On
POSTER 2015, PRAGUE MAY 14
the other hand there are also ham emails which look unusually and which would cause false positive errors. However
this system is not intended as an automatic filter but as a tool
which should help human operators. Therefore, we hope that
it can still give them a new and useful perspective on the suspicious emails.
Acknowledgements
Research described in the paper was supervised by Ing.
J. Šedivý, FEE CTU in Prague and technically supported by
Seznam.cz.
References
[1] REISNICK, P. RFC 5322: Internet message format. Technical report,
The Internet Engineering Task Force, 2008.
[2] BLANZIERI, E., BRYL, A. A survey of learning-based techniques of
email spam filtering. Artificial Intelligence Review, 2008, vol. 29, no.
1, p. 63-92.
[3] GOODMAN, J., CORMACK, V.,Heckerman, D. Spam and the ongoing battle for the inbox. Communications of the ACM, 2007, vol. 50,
no. 2, p. 24-33.
[4] DRUCKER, H., WU, S., VAPNIK, V. N. Support vector machines for
spam categorization. IEEE Transactions on Neural Networks, 1999,
vol. 10, no. 5, p. 1048-1054.
[5] PAGE, L., BRIN, S., MOTWANI, R., WINOGRAD, T. The PageRank
citation ranking: Bringing order to the web. Technical report, Stanford
InfoLab, 1999.
[6] GOGÁR, T. Illegal newsletter detection unit. Master’s thesis , Czech
Technical University in Prague, 2014.
[7] CHANDOLA, V., BANERJEE, A., KUMAR, V. Anomaly detection:
A survey. ACM Computing Surveys (CSUR), 2009, vol. 41, no. 3.
About Authors. . .
Tomáš GOGÁR was born in Jablonec nad Nisou, Czech
Republic. Tomáš received Masters degree in Artificial Intelligence from the Czech Technical University in Prague. He
now continues his studies as PhD student focusing on NLP,
mainly on Information extraction for specific domains.
7