Statistical Anomaly Detection on Real e-Mail Traffic Maurizio Aiello1, Davide Chiarella1,2, and Gianluca Papaleo1,2 1 2 National Research Council, IEIIT, Genoa, Italy University of Genoa, Department of Computer and Information Sciences, Italy {maurizio.aiello,davide.chiarella, gianluca.papaleo}@ieiit.cnr.it Abstract. There are many recent studies and proposal in Anomaly Detection Techniques, especially in worm and virus detection. In this field it does matter to answer few important questions like at which ISO/OSI layer data analysis is done and which approach is used. Furthermore these works suffer of scarcity of real data due to lack of network resources or privacy problem: almost every work in this sector uses synthetic (e.g. DARPA) or pre-made set of data. Our study is based on layer seven quantities (number of e-mail sent in a chosen period): we analyzed quantitatively our network e-mail traffic (4 SMTP servers, 10 class C networks) and applied our method on gathered data to detect indirect worm infection (worms which use e-mail to spread infection). The method is a threshold method and, in our dataset, it identified various worm activities. In this document we show our data analysis and results in order to stimulate new approaches and debates in Anomaly Intrusion Detection Techniques. Keywords: Anomaly Detection Techniques; indirect worm; real e-mail traffic. 1 Introduction Network security and Intrusion Detection Systems have become one of the research focus with the ever fast development of the Internet and the growing of unauthorized activities on the Net. Intrusion Detection Techniques are an important security barrier against computer intrusions, virus infections, spam and phishing. In the known literature there are two main approaches to worm detection [1], [2]: misuse intrusion detection and anomaly intrusion detection. The first one is based upon the signature concept, it is more accurate but it lacks the ability to identify the presence of intrusions that do not fit a pre-defined signature, resulting not adaptive [3], [4]. The second one tries to create a model to characterize a normal behaviour: the system defines the expected network behaviour and, if there are significant deviations in the short term usage from the profile, raises an alarm. It is a more adaptive system, ready to counterattack new threats, but it has a high rate of false positives [5], [6], [7], [8], 9]. Theoretically Misuse and Anomaly detection integrated together can get the holistic estimation of malicious situations on a network. Which kind of threats spread via e-mails? Primarily we can say that the main ones are worms and viruses, spam and phishing. Let’s try to summarize the whole situation. At present Internet surpasses one billion users [10] and we witness more and more cyber criminal activities originate and misuse this worldwide network by using different tools: one of the most important and relevant is the electronic-mail. In fact E. Corchado et al. (Eds.): CISIS 2008, ASC 53, pp. 170–177, 2009. springerlink.com © Springer-Verlag Berlin Heidelberg 2009 Statistical Anomaly Detection on Real e-Mail Traffic 171 nowadays Internet users are flooded by a huge amount of emails infected with worms and viruses: indeed the majority of mass-mailing worms employ a Simple Mail Transfer Protocol engine as infection delivery mechanism, so in the last years, a multitude of worm epidemics has affected millions of networked computing devices [11]. Are worms a real and growing threat? The answer is simple and we can find it in the virulent events of the last years: we have in fact thousands hosts infected and billion dollars in damage [12]. Moreover recently we witness a merge between worm and spam activities: it has been estimated that 80% of spam is sent by spam zombies [13]: an event which can make us think that future time hides bad news. How can we neutralize all these menaces? In Intrusion Detection Techniques many types of research have been developed during years. Proposed Network Intrusion Detection Systems worked and work on information available on different TCP stack layers: Datalink ( e.g. monitoring ARP [14] ), Network (e.g. monitoring IP, BGP and ICMP [15], [16] ), Transport (e.g. monitoring DNS MX query [17,18]) and Application (e.g. monitoring System Calls[19,20] ). Sometimes, because of enormous relative features available on different levels researcher correlate information gathered on each level in order to improve the effectiveness of the detection system. In a similar way we propose to work with e-mail focusing our attention on quantities considering that all the three above phenomena have something in common: they all use SMTP [21] as proliferation medium. In this paper our goal is to present a dataset analysis which reflects the complete SMTP traffic sent by seven /24 network in order to detect worm and virus infections given that no user in our network is a spammer. Moreover we want to stress that the dataset we worked on is genuine and not a synthetic one ( like KDD 99 [22] and DARPA [23] ) so we hope that it might be inspiring to other researcher and that probably in the near future our work might produce a genuine data set at everyone’s disposal. The paper is structured as following. Section 2 introduces the analysis’ scenario. Section 3 discusses the dataset we worked on. Our analysis’ theory and methods are described in section 4, and our experimental results using our tools to analyze mail activities discussed in section 5. In section 6, we give our conclusion. 2 Scenario Our approach is highly experimental. In fact we work on eleven local area network (C class) interconnected by a layer three switch and directly connected to Internet (no NAT [24] policies, all public IP). In this network we have five different mail-servers, varying from Postfix to Sendmail. As every system administrator knows every server has its own kind of log and the main problem with log files is that every transaction is merged with the other ones and they are of difficult reading by a human beings: for this reason we focused our anomaly detection on a single Postfix mail-server, optimizing our efforts. Every mail is checked by an antivirus server (Sophos). To circumvent spam we have SpamAssassin [25] and Greylisting [26]. SpamAssassin is a software which uses a variety of mechanisms including header and text analysis, Bayesian filtering, DNS blocklists, and collaborative filtering databases to detect spam. A mail transfer agent which uses greylisting temporarily rejects email from 172 M. Aiello, D. Chiarella, and G. Papaleo unknown senders. If the mail is legitimate, the originating server will try again to send it later according to RFC, while a mail which is from a spammer, it will probably not be retried because very few spammers are RFC compliant. The few spam sources which re-transmit later are more likely to be listed in DNSBLs [27] and distributed signature systems. Greylisting and SpamAssassin reduced heavily our spam percentage. To make a complete description we must add that port 25 is monitored and filtered: in fact the hosts inside our network can’t communicate with a host outside our network on port 25 and an outsider can’t communicate with an our host on port 25. These restriction nullify two threats: the first one concerns the infected hosts which can become spam-zombie pc; the second one concerns the SMTP relaying misuse problem. In fact since we are a research institution almost all the hosts are used by a single person who detains root privileges, so she can eventually install a SMTP server. Only few of total hosts are shared among different people (students, fellow researcher etc.). We have a good balancing between Linux operating systems distribution and Windows ones. We focus our attention on one mail server which has installed a Postfix e-mail server. Every mail is checked by the antivirus server updated once an hour: this is an important fact because it assures that all the worms found during analysis are zero-day worm [28-30]. This server supplies service to 300 users with a wide international social network due to the research mission of our Institution [31]: this fact grant us a huge amount of SMTP traffic. 3 Dataset We analyze mail-server log of 1065 days length period (2 years and 11 months). To speed up the process we used LMA (Log Mail Analyzer [32]) to make the log more readable. LMA is a Perl program, open source, available on Sourceforge, which makes Postfix and Sendmail logs human readable. It reconstructs every single e-mail transaction spread across the mail server log and it creates a plain text file in a simpler format like. Every row represents a single transaction and it has the following fields: • Timestamp: it is the moment in which the e-mail has been sent: it is possible to have this information in Unix timestamp format or through the Julian format in standard date. • Client: it is the hostname of e-mail sender (HELO identifier). • Client IP: it is the IP of the sender’s host. • From: it is the e-mail address of the sender. • To: it is the e-mail address of the receiver. • Status: it is the server response (e.g. 450, 550 etc.). With this format is possible to find the moment in which the e-mail has been sent, the sender client name and IP, the from and to field of the e-mail and the server response. Lets make an example: if [email protected] send an e-mail on 23 march 2006 to [email protected] from X.X.2.235 and all the e-mail server transactions go successful we will have a record like this: 23/03/2006 X.X.2.235 [email protected] [email protected] 250 Statistical Anomaly Detection on Real e-Mail Traffic 173 As already said, we want to stress that our data is not synthetic and so it doesn’t suffer of bias of any form: it reflects the complete set of emails received by a single hightraffic e-mail server and it represents the overall view of a typical network operator. Furthermore, contrary to synthetic ones, it suffers of accidental hardware faults: can you say that a network topology is static and it is not prone to wanted and unwanted changes? Intrusion Detection evaluation dataset have some hypothesis, one of these is the never changing topology and immortal hardware health. As a matter of fact this is not true, this is not reality: Murphy’s Law holds true and strikes with extraordinary efficiency. In addition our data are only about SMTP flow and, due to the long-term monitoring, are a good snapshot of all-day life and, romantically, a silent witness of Internet growth and e-mail use growth. 4 Analysis Our analysis has been made on the e-mail traffic of ten C-class network in a period of 900 days, from January 2004 to November 2006. In our analysis, we work on the global e-mail flow in a given time interval. We use a threshold detection [33], like other software do (e.g. Snort ): if the traffic volume rises above a given threshold, the system triggers an alarm. The given threshold is calculated in a statistical way, where we determine the network normal e-mail traffic in selected slices of time: for example we take the activity of a month and we divide the month in five-minutes slices, calculating how many e-mails are normally sent in five minutes. After that, we check that the number of e-mails sent in a day during each interval don’t exceed the threshold. We call this kind of analysis base-line analysis. Our strategy is to study the temporal correlation between the present behaviour (maybe modified by the presence of a worm activity) of a given entity (pc, entire network) and its past behaviour (normal activity, no virus or worm presence). Before proceeding, however, we pre-process the data subtracting the mean to the values and cutting all the interval with a negative number of e-mails, because we wanted to obfuscate the no-activity and few activity periods, not interesting for our purposes. In other words we trashed all the time slices characterized by a number of e-mail sent below the month average, with the purpose of dynamically selecting activity periods (working hours, no holidays etc). If we didn’t perform this pre-processing we could have had an average which depended on night time, weekend or holidays duration. E-mails sent mean in 2004, before pre-processing, was 524 in a day for 339 activity day: after data pre-processing was 773 in a day for 179 activity day. After this we calculate the baseline activity of working hours according to the following: μ + 3σ. The mean and the variance are calculated for every month, modelling the network behaviour, taking into account every chosen time interval (e.g. we divide February in five-minutes slices, we count how many e-mails are sent in these periods and then we calculate the mean of these intervals). Values have been compared with the baseline threshold and if found greater than it they have been marked. Analyzing the first five months with a five minutes slice we found too many alerts and a lot of them exceeded the threshold only for few e-mails. So we thought to correlate the alerts found with a five minutes period with those found with an hour period, with the hypothesis that a worm which has infected a host sends a lot of e-mail both in a short period and in a bit 174 M. Aiello, D. Chiarella, and G. Papaleo longer period. To clarify the concept lets take the analysis for a month: April 2004 (see Fig. 1 and Fig. 2). The five minutes base-line resulted in 63 e-mails while the one hour base-line is 463. In five-minute analysis we found sixteen alerts, meanwhile in one-hour analysis only three. Why do we find a so big gap between the two approaches? In five-minutes analysis we have a lot of false alarms, due to the presence of e-mails sent to very large mailing lists while in one-hour analysis we find very few alarms, but these alarms result more significant because they represents a continuative violation of the normal (expected) activity. Correlating these results, searching the selected five-minutes periods in the five one-hour alert we detected that a little set of the five-minute alarms were near in the temporal line: after a deeper analysis, using our knowledge and experience on real user’s activity we concluded that it was a worm activity. Fig. 1. Example of e-mail traffic: hour base-line Fig. 2. Example of e-mail traffic: five minutes base-line 4.1 SMTP Sender Analysis Sometimes, peaks catch from flow analysis were e-mail sent to mailing list which are, as already said, bothersome hoaxes. This fact produced from analysis, where we analyze how many different e-mail address every host use: we look which from field is Statistical Anomaly Detection on Real e-Mail Traffic 175 used by every host. In fact an host, owned by a single person or few persons, is not likely to use a lot of different e-mail addresses in a short time and if it does so, it is highly considerable a suspicious behaviour. So we think that this analysis could be used to identify true positives, or to suggest suspect activity. Of course it isn’t so straight that a worm will change from field continuously, but it is a likely event. 4.2 SMTP Reject Analysis One typical feature of a malware is haste in spreading the infection. This haste leads indirect worms to send a lot of e-mail to unknown receivers or nonexistent e-mail address: this is a mistake that, we think, it is very important. In fact all e-mails sent to a nonexistent e-mail address are rejected by the mail-server, and they are tracked in the log. In this step of our work we analyze rejected e-mail flow: we work only on emails referred by internet server. By this approach we identified worm activity. Table 1. Experimental results Date Infected Host Analysis 28/01/04 18:00 X.X.6.24 Baseline, from, reject 29/01/04 10:30 X.X.4.24 From, reject 28/04/2004 14:58-15:03 X.X.7.20 Baseline, from, reject 28/04/2004 15:53-15:58 X.X.5.216 Baseline, from, reject 29/04/2004 09:08-10:03 X.X.6.36, X.X.7.20 Baseline, from, reject 04/05/2004 12:05-12:10 X.X.5.158 Baseline, from, reject 04/05/2004 13:15-13:25 X.X.5.158 Baseline, from, reject 31/08/04 14:51 X.X.3.234 Baseline, reject X.X.3.234 Baseline, reject X.X.3.101, X.X.3.200, X.X.3.234 X.X.5.123 Baseline, reject X.X.10.10 Baseline, from, reject X.X.10.10 Baseline, from, reject X.X.10.10 Baseline, from, reject X.X.10.10 Baseline, from, reject 31/08/04 17:46 23/11/04 11:38 22/08/05 17:13 22/08/05 17:43 22/08/05 20:18 22/08/05 22:08-22:13 176 M. Aiello, D. Chiarella, and G. Papaleo 5 Results The approach does detect 14 worms activity, mostly concentrated in 2004. We think that this fact is caused by new firewall policies introduced in 2005 and by the introduction of a second antivirus engine in our Mail Transfer Agent. Moreover in last years we haven’t got very large worm infection. The results we obtained are summarized in Table 1. 6 Conclusion Baseline analysis can be useful in identifying some indirect worm activity, but this approach need some integration by some other methods, because it lacks a complete vision of SMTP activity: this lack can be filled by methods which analyze some other SMTP aspects, like From and To e-mail field. In future this method can be integrated in an anomaly detection system to get more accuracy in detecting anomalies. Acknowledgments. This work was supported by National Research Council of Italy and University of Genoa. References 1. Axelsson, S.: Intrusion detection systems: A survey and taxonomy,Tech. Rep. 99-15, Chalmers Univ (March 2000) 2. Verwoerd, T., Hunt, R.: Intrusion detection techniques and approaches. Comput. Commun. 25(15), 1356–1365 (2002) 3. Ilgun, K., Kemmerer, R.A., Porras, P.A.: State transition analysis: A rule-based intrusion detection approach. IEEE Transactions on Software Engineering 21(3), 181–199 (1995) 4. Kumar, S., Spafford, E.H.: A software architecture to support misuse intrusion detection. In: Proceedings of the 18th National Information Security Conference, pp. 194–204 (1995) 5. Denning, D.E.: An intrusion detection model. IEEE Transactions on Software Engineering (1987) 6. Estvez-Tapiador, J.M., Garcia-Teodoro, P., Diaz-Verdejo, J.E.: Anomaly detection methods in wired networks: A survey and taxonomy. Comput. Commun. 27(16), 1569–1584 (2004) 7. Du, Y., Wang, W.-q., Pang, Y.-G.: An intrusion detection method using average hamming distance. In: Proceedings of the Third International Conference on Machine Learning and Cybernetics, Shanghai, 26-29 August (2004) 8. Anderson, D., Frivold, T., Valdes, A.: Next-generation intrusion detection expert system (NIDES). Computer Science Laboratory (SRI Intemational, Menlo Park, CA): Technical reportSRI-CSL-95-07 (1995) 9. Wang, Y., Abdel-Wahab, H.: A Multilayer Approach of Anomaly Detection for Email Systems. In: Proceedings of the 11th IEEE Symposium on Computers and Communications (ISCC 2006) (2006) 10. http://www.internetworldstats.com/stats.htm 11. http://en.wikipedia.org/wiki/ Notable_computer_viruses_and_worms Statistical Anomaly Detection on Real e-Mail Traffic 177 12. Moore, D., Paxson, V., Savage, S., Shannon, C., Staniford, S., Weaver, N.: Inside the slammer worm. IEEE Magazine of Security and Privacy, 33–39 (July/August 2003) 13. Leyden, J.: Zombie PCs spew out 80% of spam. The Register (June 2004) 14. Yasami, Y., Farahmand, M., Zargari, V.: An ARP-based Anomaly Detection Algorithm Using Hidden Markov Model in Enterprise Networks. In: Second International Conference on Systems and Networks Communications (ICSNC 2007) (2007) 15. Berk, V., Bakos, G., Morris, R.: Designing a Framework for Active Worm Detection on Global Networks. In: Proceedings of the first IEEE International Workshop on Information Assurance (IWIA 2003), Darmstadt, Germany (March 2003) 16. Bakos, G., Berk, V.: Early detection of internet worm activity by metering icmp destination unreachable messages. In: Proceedings of the SPIE Aerosense 2002 (2002) 17. Whyte, D., Kranakis, E., van Oorschot, P.C.: DNS-based Detection of Scanning Worms in an Enterprise Network. In: Proceedings of the 12th Annual Network and Distributed System Security Symposium, San Diego, USA, February 3-4 (2005) 18. Whyte, D., van Oorschot, P.C., Kranakis, E.: Addressing Malicious SMTP-based MassMailing Activity Within an Enterprise Network 19. Hofmeyr, S.A., Forrest, S., Somayaji, A.: Intrusion Detection using Sequences of System Calls. Journal of Computer Security 6(3), 151–180 (1998) 20. Cha, B.: Host anomaly detection performance analysis based on system call of NeuroFuzzy using Soundex algorithm and N-gram technique. In: Proceedings of the 2005 Systems Communications (ICW 2005) (2005) 21. http://www.ietf.org/rfc/rfc0821.txt 22. http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html 23. http://www.ll.mit.edu/IST/ideval/data/data_index.html 24. http://en.wikipedia.org/wiki/Network_address_translation 25. http://spamassassin.apache.org/ 26. Harris, E.: The Next Step in the Spam Control War: Greylisting 27. http://en.wikipedia.org/wiki/DNSBL 28. Crandall, J.R., Su, Z., Wu, S.F., Chong, F.T.: On Deriving Unknown Vulnerabilities from Zero-Day Polymorphic and Metamorphic Worm Exploits. In: CCS 2005, Alexandria, Virginia, USA, November 7–11 (2005) 29. Portokalidis, G., Bos, H.: SweetBait: Zero-Hour Worm Detection and Containment Using Honeypots 30. Akritidis, P., Anagnostakis, K., Markatos, E.P.: Efficient Content-Based Detection of Zero-DayWorms 31. http://www.cnr.it/sitocnr/home.html 32. http://lma.sourceforge.net/ 33. Behaviour-Based Network Security Goes Mainstream, David Geer, Computer (March 2006)
© Copyright 2025 Paperzz