00530170.pdf

Statistical Anomaly Detection on Real e-Mail Traffic
Maurizio Aiello1, Davide Chiarella1,2, and Gianluca Papaleo1,2
1
2
National Research Council, IEIIT, Genoa, Italy
University of Genoa, Department of Computer and Information Sciences, Italy
{maurizio.aiello,davide.chiarella,
gianluca.papaleo}@ieiit.cnr.it
Abstract. There are many recent studies and proposal in Anomaly Detection Techniques, especially in worm and virus detection. In this field it does matter to answer few important
questions like at which ISO/OSI layer data analysis is done and which approach is used. Furthermore these works suffer of scarcity of real data due to lack of network resources or privacy
problem: almost every work in this sector uses synthetic (e.g. DARPA) or pre-made set of data.
Our study is based on layer seven quantities (number of e-mail sent in a chosen period): we
analyzed quantitatively our network e-mail traffic (4 SMTP servers, 10 class C networks) and
applied our method on gathered data to detect indirect worm infection (worms which use e-mail
to spread infection). The method is a threshold method and, in our dataset, it identified various
worm activities. In this document we show our data analysis and results in order to stimulate
new approaches and debates in Anomaly Intrusion Detection Techniques.
Keywords: Anomaly Detection Techniques; indirect worm; real e-mail traffic.
1 Introduction
Network security and Intrusion Detection Systems have become one of the research
focus with the ever fast development of the Internet and the growing of unauthorized
activities on the Net. Intrusion Detection Techniques are an important security barrier
against computer intrusions, virus infections, spam and phishing. In the known literature there are two main approaches to worm detection [1], [2]: misuse intrusion detection and anomaly intrusion detection. The first one is based upon the signature
concept, it is more accurate but it lacks the ability to identify the presence of intrusions that do not fit a pre-defined signature, resulting not adaptive [3], [4]. The second
one tries to create a model to characterize a normal behaviour: the system defines the
expected network behaviour and, if there are significant deviations in the short term
usage from the profile, raises an alarm. It is a more adaptive system, ready to counterattack new threats, but it has a high rate of false positives [5], [6], [7], [8], 9]. Theoretically Misuse and Anomaly detection integrated together can get the holistic
estimation of malicious situations on a network.
Which kind of threats spread via e-mails? Primarily we can say that the main ones
are worms and viruses, spam and phishing. Let’s try to summarize the whole situation.
At present Internet surpasses one billion users [10] and we witness more and more
cyber criminal activities originate and misuse this worldwide network by using different tools: one of the most important and relevant is the electronic-mail. In fact
E. Corchado et al. (Eds.): CISIS 2008, ASC 53, pp. 170–177, 2009.
springerlink.com
© Springer-Verlag Berlin Heidelberg 2009
Statistical Anomaly Detection on Real e-Mail Traffic
171
nowadays Internet users are flooded by a huge amount of emails infected with worms
and viruses: indeed the majority of mass-mailing worms employ a Simple Mail Transfer Protocol engine as infection delivery mechanism, so in the last years, a multitude
of worm epidemics has affected millions of networked computing devices [11]. Are
worms a real and growing threat? The answer is simple and we can find it in the virulent events of the last years: we have in fact thousands hosts infected and billion dollars in damage [12]. Moreover recently we witness a merge between worm and spam
activities: it has been estimated that 80% of spam is sent by spam zombies [13]: an
event which can make us think that future time hides bad news. How can we neutralize all these menaces? In Intrusion Detection Techniques many types of research have
been developed during years. Proposed Network Intrusion Detection Systems worked
and work on information available on different TCP stack layers: Datalink ( e.g.
monitoring ARP [14] ), Network (e.g. monitoring IP, BGP and ICMP [15], [16] ),
Transport (e.g. monitoring DNS MX query [17,18]) and Application (e.g. monitoring
System Calls[19,20] ). Sometimes, because of enormous relative features available
on different levels researcher correlate information gathered on each level in order to
improve the effectiveness of the detection system. In a similar way we propose to
work with e-mail focusing our attention on quantities considering that all the three
above phenomena have something in common: they all use SMTP [21] as proliferation medium.
In this paper our goal is to present a dataset analysis which reflects the complete
SMTP traffic sent by seven /24 network in order to detect worm and virus infections
given that no user in our network is a spammer. Moreover we want to stress that the
dataset we worked on is genuine and not a synthetic one ( like KDD 99 [22] and
DARPA [23] ) so we hope that it might be inspiring to other researcher and that
probably in the near future our work might produce a genuine data set at everyone’s
disposal.
The paper is structured as following. Section 2 introduces the analysis’ scenario.
Section 3 discusses the dataset we worked on. Our analysis’ theory and methods are
described in section 4, and our experimental results using our tools to analyze mail activities discussed in section 5. In section 6, we give our conclusion.
2 Scenario
Our approach is highly experimental. In fact we work on eleven local area network (C
class) interconnected by a layer three switch and directly connected to Internet (no
NAT [24] policies, all public IP). In this network we have five different mail-servers,
varying from Postfix to Sendmail. As every system administrator knows every server
has its own kind of log and the main problem with log files is that every transaction is
merged with the other ones and they are of difficult reading by a human beings: for
this reason we focused our anomaly detection on a single Postfix mail-server,
optimizing our efforts. Every mail is checked by an antivirus server (Sophos). To circumvent spam we have SpamAssassin [25] and Greylisting [26]. SpamAssassin is a
software which uses a variety of mechanisms including header and text analysis,
Bayesian filtering, DNS blocklists, and collaborative filtering databases to detect
spam. A mail transfer agent which uses greylisting temporarily rejects email from
172
M. Aiello, D. Chiarella, and G. Papaleo
unknown senders. If the mail is legitimate, the originating server will try again to send
it later according to RFC, while a mail which is from a spammer, it will probably not
be retried because very few spammers are RFC compliant. The few spam sources
which re-transmit later are more likely to be listed in DNSBLs [27] and distributed
signature systems. Greylisting and SpamAssassin reduced heavily our spam percentage. To make a complete description we must add that port 25 is monitored and filtered: in fact the hosts inside our network can’t communicate with a host outside our
network on port 25 and an outsider can’t communicate with an our host on port 25.
These restriction nullify two threats: the first one concerns the infected hosts which
can become spam-zombie pc; the second one concerns the SMTP relaying misuse
problem. In fact since we are a research institution almost all the hosts are used by a
single person who detains root privileges, so she can eventually install a SMTP
server. Only few of total hosts are shared among different people (students, fellow
researcher etc.). We have a good balancing between Linux operating systems distribution and Windows ones. We focus our attention on one mail server which has installed a Postfix e-mail server. Every mail is checked by the antivirus server updated
once an hour: this is an important fact because it assures that all the worms found during analysis are zero-day worm [28-30].
This server supplies service to 300 users with a wide international social network
due to the research mission of our Institution [31]: this fact grant us a huge amount of
SMTP traffic.
3 Dataset
We analyze mail-server log of 1065 days length period (2 years and 11 months). To
speed up the process we used LMA (Log Mail Analyzer [32]) to make the log more
readable. LMA is a Perl program, open source, available on Sourceforge, which
makes Postfix and Sendmail logs human readable. It reconstructs every single e-mail
transaction spread across the mail server log and it creates a plain text file in a simpler
format like. Every row represents a single transaction and it has the following fields:
• Timestamp: it is the moment in which the e-mail has been sent: it is possible to
have this information in Unix timestamp format or through the Julian format in
standard date.
• Client: it is the hostname of e-mail sender (HELO identifier).
• Client IP: it is the IP of the sender’s host.
• From: it is the e-mail address of the sender.
• To: it is the e-mail address of the receiver.
• Status: it is the server response (e.g. 450, 550 etc.).
With this format is possible to find the moment in which the e-mail has been sent, the
sender client name and IP, the from and to field of the e-mail and the server response.
Lets make an example: if [email protected] send an e-mail on 23 march 2006 to [email protected] from X.X.2.235 and all the e-mail server transactions go successful
we will have a record like this:
23/03/2006 X.X.2.235 [email protected] [email protected] 250
Statistical Anomaly Detection on Real e-Mail Traffic
173
As already said, we want to stress that our data is not synthetic and so it doesn’t suffer
of bias of any form: it reflects the complete set of emails received by a single hightraffic e-mail server and it represents the overall view of a typical network operator.
Furthermore, contrary to synthetic ones, it suffers of accidental hardware faults: can
you say that a network topology is static and it is not prone to wanted and unwanted
changes? Intrusion Detection evaluation dataset have some hypothesis, one of these is
the never changing topology and immortal hardware health. As a matter of fact this is
not true, this is not reality: Murphy’s Law holds true and strikes with extraordinary efficiency. In addition our data are only about SMTP flow and, due to the long-term
monitoring, are a good snapshot of all-day life and, romantically, a silent witness of
Internet growth and e-mail use growth.
4 Analysis
Our analysis has been made on the e-mail traffic of ten C-class network in a period of
900 days, from January 2004 to November 2006. In our analysis, we work on the
global e-mail flow in a given time interval. We use a threshold detection [33], like
other software do (e.g. Snort ): if the traffic volume rises above a given threshold, the
system triggers an alarm. The given threshold is calculated in a statistical way, where
we determine the network normal e-mail traffic in selected slices of time: for example
we take the activity of a month and we divide the month in five-minutes slices, calculating how many e-mails are normally sent in five minutes. After that, we check that
the number of e-mails sent in a day during each interval don’t exceed the threshold.
We call this kind of analysis base-line analysis. Our strategy is to study the temporal correlation between the present behaviour (maybe modified by the presence of a
worm activity) of a given entity (pc, entire network) and its past behaviour (normal
activity, no virus or worm presence). Before proceeding, however, we pre-process the
data subtracting the mean to the values and cutting all the interval with a negative
number of e-mails, because we wanted to obfuscate the no-activity and few activity
periods, not interesting for our purposes.
In other words we trashed all the time slices characterized by a number of e-mail sent
below the month average, with the purpose of dynamically selecting activity periods
(working hours, no holidays etc). If we didn’t perform this pre-processing we could
have had an average which depended on night time, weekend or holidays duration.
E-mails sent mean in 2004, before pre-processing, was 524 in a day for 339 activity day: after data pre-processing was 773 in a day for 179 activity day. After this we
calculate the baseline activity of working hours according to the following: μ + 3σ.
The mean and the variance are calculated for every month, modelling the network
behaviour, taking into account every chosen time interval (e.g. we divide February in
five-minutes slices, we count how many e-mails are sent in these periods and then we
calculate the mean of these intervals). Values have been compared with the baseline
threshold and if found greater than it they have been marked. Analyzing the first five
months with a five minutes slice we found too many alerts and a lot of them exceeded
the threshold only for few e-mails. So we thought to correlate the alerts found with a
five minutes period with those found with an hour period, with the hypothesis that a
worm which has infected a host sends a lot of e-mail both in a short period and in a bit
174
M. Aiello, D. Chiarella, and G. Papaleo
longer period. To clarify the concept lets take the analysis for a month: April 2004
(see Fig. 1 and Fig. 2). The five minutes base-line resulted in 63 e-mails while the one
hour base-line is 463. In five-minute analysis we found sixteen alerts, meanwhile in
one-hour analysis only three. Why do we find a so big gap between the two approaches? In five-minutes analysis we have a lot of false alarms, due to the presence
of e-mails sent to very large mailing lists while in one-hour analysis we find very few
alarms, but these alarms result more significant because they represents a continuative
violation of the normal (expected) activity.
Correlating these results, searching the selected five-minutes periods in the five
one-hour alert we detected that a little set of the five-minute alarms were near in the
temporal line: after a deeper analysis, using our knowledge and experience on real
user’s activity we concluded that it was a worm activity.
Fig. 1. Example of e-mail traffic: hour base-line
Fig. 2. Example of e-mail traffic: five minutes base-line
4.1 SMTP Sender Analysis
Sometimes, peaks catch from flow analysis were e-mail sent to mailing list which are,
as already said, bothersome hoaxes. This fact produced from analysis, where we analyze how many different e-mail address every host use: we look which from field is
Statistical Anomaly Detection on Real e-Mail Traffic
175
used by every host. In fact an host, owned by a single person or few persons, is not
likely to use a lot of different e-mail addresses in a short time and if it does so, it is
highly considerable a suspicious behaviour. So we think that this analysis could be
used to identify true positives, or to suggest suspect activity. Of course it isn’t so
straight that a worm will change from field continuously, but it is a likely event.
4.2 SMTP Reject Analysis
One typical feature of a malware is haste in spreading the infection. This haste leads
indirect worms to send a lot of e-mail to unknown receivers or nonexistent e-mail address: this is a mistake that, we think, it is very important. In fact all e-mails sent to a
nonexistent e-mail address are rejected by the mail-server, and they are tracked in the
log. In this step of our work we analyze rejected e-mail flow: we work only on emails referred by internet server. By this approach we identified worm activity.
Table 1. Experimental results
Date
Infected Host
Analysis
28/01/04 18:00
X.X.6.24
Baseline, from, reject
29/01/04 10:30
X.X.4.24
From, reject
28/04/2004 14:58-15:03
X.X.7.20
Baseline, from, reject
28/04/2004 15:53-15:58
X.X.5.216
Baseline, from, reject
29/04/2004 09:08-10:03
X.X.6.36, X.X.7.20
Baseline, from, reject
04/05/2004 12:05-12:10
X.X.5.158
Baseline, from, reject
04/05/2004 13:15-13:25
X.X.5.158
Baseline, from, reject
31/08/04 14:51
X.X.3.234
Baseline, reject
X.X.3.234
Baseline, reject
X.X.3.101, X.X.3.200, X.X.3.234
X.X.5.123
Baseline, reject
X.X.10.10
Baseline, from, reject
X.X.10.10
Baseline, from, reject
X.X.10.10
Baseline, from, reject
X.X.10.10
Baseline, from, reject
31/08/04 17:46
23/11/04 11:38
22/08/05 17:13
22/08/05 17:43
22/08/05 20:18
22/08/05 22:08-22:13
176
M. Aiello, D. Chiarella, and G. Papaleo
5 Results
The approach does detect 14 worms activity, mostly concentrated in 2004. We think
that this fact is caused by new firewall policies introduced in 2005 and by the introduction of a second antivirus engine in our Mail Transfer Agent. Moreover in last
years we haven’t got very large worm infection. The results we obtained are summarized in Table 1.
6 Conclusion
Baseline analysis can be useful in identifying some indirect worm activity, but this
approach need some integration by some other methods, because it lacks a complete
vision of SMTP activity: this lack can be filled by methods which analyze some other
SMTP aspects, like From and To e-mail field. In future this method can be integrated
in an anomaly detection system to get more accuracy in detecting anomalies.
Acknowledgments. This work was supported by National Research Council of Italy
and University of Genoa.
References
1. Axelsson, S.: Intrusion detection systems: A survey and taxonomy,Tech. Rep. 99-15,
Chalmers Univ (March 2000)
2. Verwoerd, T., Hunt, R.: Intrusion detection techniques and approaches. Comput. Commun. 25(15), 1356–1365 (2002)
3. Ilgun, K., Kemmerer, R.A., Porras, P.A.: State transition analysis: A rule-based intrusion
detection approach. IEEE Transactions on Software Engineering 21(3), 181–199 (1995)
4. Kumar, S., Spafford, E.H.: A software architecture to support misuse intrusion detection.
In: Proceedings of the 18th National Information Security Conference, pp. 194–204 (1995)
5. Denning, D.E.: An intrusion detection model. IEEE Transactions on Software Engineering
(1987)
6. Estvez-Tapiador, J.M., Garcia-Teodoro, P., Diaz-Verdejo, J.E.: Anomaly detection methods in wired networks: A survey and taxonomy. Comput. Commun. 27(16), 1569–1584
(2004)
7. Du, Y., Wang, W.-q., Pang, Y.-G.: An intrusion detection method using average hamming
distance. In: Proceedings of the Third International Conference on Machine Learning and
Cybernetics, Shanghai, 26-29 August (2004)
8. Anderson, D., Frivold, T., Valdes, A.: Next-generation intrusion detection expert system
(NIDES). Computer Science Laboratory (SRI Intemational, Menlo Park, CA): Technical
reportSRI-CSL-95-07 (1995)
9. Wang, Y., Abdel-Wahab, H.: A Multilayer Approach of Anomaly Detection for Email
Systems. In: Proceedings of the 11th IEEE Symposium on Computers and Communications (ISCC 2006) (2006)
10. http://www.internetworldstats.com/stats.htm
11. http://en.wikipedia.org/wiki/
Notable_computer_viruses_and_worms
Statistical Anomaly Detection on Real e-Mail Traffic
177
12. Moore, D., Paxson, V., Savage, S., Shannon, C., Staniford, S., Weaver, N.: Inside the
slammer worm. IEEE Magazine of Security and Privacy, 33–39 (July/August 2003)
13. Leyden, J.: Zombie PCs spew out 80% of spam. The Register (June 2004)
14. Yasami, Y., Farahmand, M., Zargari, V.: An ARP-based Anomaly Detection Algorithm
Using Hidden Markov Model in Enterprise Networks. In: Second International Conference
on Systems and Networks Communications (ICSNC 2007) (2007)
15. Berk, V., Bakos, G., Morris, R.: Designing a Framework for Active Worm Detection on
Global Networks. In: Proceedings of the first IEEE International Workshop on Information Assurance (IWIA 2003), Darmstadt, Germany (March 2003)
16. Bakos, G., Berk, V.: Early detection of internet worm activity by metering icmp destination unreachable messages. In: Proceedings of the SPIE Aerosense 2002 (2002)
17. Whyte, D., Kranakis, E., van Oorschot, P.C.: DNS-based Detection of Scanning Worms in
an Enterprise Network. In: Proceedings of the 12th Annual Network and Distributed System Security Symposium, San Diego, USA, February 3-4 (2005)
18. Whyte, D., van Oorschot, P.C., Kranakis, E.: Addressing Malicious SMTP-based MassMailing Activity Within an Enterprise Network
19. Hofmeyr, S.A., Forrest, S., Somayaji, A.: Intrusion Detection using Sequences of System
Calls. Journal of Computer Security 6(3), 151–180 (1998)
20. Cha, B.: Host anomaly detection performance analysis based on system call of NeuroFuzzy using Soundex algorithm and N-gram technique. In: Proceedings of the 2005 Systems Communications (ICW 2005) (2005)
21. http://www.ietf.org/rfc/rfc0821.txt
22. http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
23. http://www.ll.mit.edu/IST/ideval/data/data_index.html
24. http://en.wikipedia.org/wiki/Network_address_translation
25. http://spamassassin.apache.org/
26. Harris, E.: The Next Step in the Spam Control War: Greylisting
27. http://en.wikipedia.org/wiki/DNSBL
28. Crandall, J.R., Su, Z., Wu, S.F., Chong, F.T.: On Deriving Unknown Vulnerabilities from
Zero-Day Polymorphic and Metamorphic Worm Exploits. In: CCS 2005, Alexandria, Virginia, USA, November 7–11 (2005)
29. Portokalidis, G., Bos, H.: SweetBait: Zero-Hour Worm Detection and Containment Using
Honeypots
30. Akritidis, P., Anagnostakis, K., Markatos, E.P.: Efficient Content-Based Detection of
Zero-DayWorms
31. http://www.cnr.it/sitocnr/home.html
32. http://lma.sourceforge.net/
33. Behaviour-Based Network Security Goes Mainstream, David Geer, Computer (March
2006)