On-the-fly Statistical Classification of Internet Traffic at Application Layer Based on Cluster Analysis Andrea Baiocchi1, Gianluca Maiolini2, Giacomo Molina1, and Antonello Rizzi1 1 INFOCOM Dept., University of Roma “Sapienza” Via Eudossiana 18 - 00184 Rome, Italy [email protected], [email protected], [email protected] 2 ELSAG Datamat – Divisione automazione sicurezza e trasporti, Via Laurentina 760 – 00143 Rome, Italy [email protected] Abstract. We address the problem of classifying Internet packet flows according to the application level protocol that generated them. Unlike deep packet inspection, which reads up to application layer payloads and keeps track of packet sequences, we consider classification based on statistical features extracted in real time from the packet flow, namely IP packet lengths and inter-arrival times. A statistical classification algorithm is proposed, built upon the powerful and rich tools of cluster analysis. By exploiting traffic traces taken at the Networking Lab of our Department and traces from CAIDA, we defined data sets made up of thousands of flows for up to five different application protocols. With the classic approach of training and test data sets we show that cluster analysis yields very good results in spite of the little information it is based on, to stick to the real time decision requirement. We aim to show that the investigated applications are characterized from a ”signature” at the network layer that can be useful to recognize such applications even when the port number is not significant. Numerical results are presented to highlight the effect of major algorithm parameters. We discuss complexity and possible exploitation of the statistical classifier. 1 Introduction As broadband communications widen the range of popular applications, there is an increasing demand of fast traffic classification means according to the services data is generated by. The specific meaning of service depends on the context and purpose of traffic classifications. In case of traffic filtering for security or policy enforcement purposes, service can be usually identified with application layer protocol. However, many kind of different services exploit http or ssh (e.g. file transfer, multimedia communications, even P2P), so that a simple header based filter (e.g. exploiting the IP address and TCP/UDP port numbers) may be inadequate. Traffic classification at application level can be therefore based on the analysis of the entire packets content (header plus payload), usually by means of finite state machine schemes. Although there are widely available software tools for such a classification approach (e.g. L7filter, BRO, Snort), they can hardly catch up with high speed links and are usually inadequate for backbone use (e.g. Gbps links). E. Corchado et al. (Eds.): CISIS 2008, ASC 53, pp. 178–185, 2009. springerlink.com © Springer-Verlag Berlin Heidelberg 2009 On-the-fly Statistical Classification of Internet Traffic 179 The solution based on port analysis is becoming ineffective because of applications running on non-standard ports (e.g. peer-to-peer). Furthermore, traffic classification based on deep packet inspection is resource-consuming and hard to implement on high capacity links (e.g. Gbps OC links). For these reasons, different approaches to traffic classification have been developed, using all the information available at network layer. Some proposals ([4], [5]), however, need semantically complete TCP flows as input: we target a real-time tool, able to classify the application layer protocol of a TCP connection by observing just the first few packets of the connection (hereinafter referred to as a flow). A number of works [5], [6], [7] rely on unsupervised learning techniques. The only features they use is packets size and packets direction: they demonstrate the effectiveness of these algorithms even using a small number of packets (e.g. the first four of a TCP connection). We believe that even packets inter-arrival time contains pieces of information relevant to address the classification problem. We provide a way to exalt the information contained in inter-arrival times, preserving the real-time characteristic of the approach described in [7]: we try to clean the interarrival time (as better explained in section 3) assessing the contribution of network congestion, to exalt the time depending on the application layer protocol. The paper is organized as follows. In Section 2 the classification problem is defined and notation is introduced. Section 3 is devoted to the description of the traffic data sets used for the defined algorithm assessment and the numerical evaluation. The cluster analysis based statistical classifier is defined in Section 4. Numerical examples are given in Section 5 and final remarks are stated in Section 6. 2 Problem Statement In this paper, we focus on the classification of IP flows generated from network applications communicating through TCP protocol as HTTP, SMTP, POP3, etc. With this in mind, we define flow F as the unidirectional, ordered sequence of IP packets produced either by the client towards the server, or by the server towards the client during an application layer session. The server-client flow FServer will be composed of (NServer + 1) IP packets, from PK0 to PKNserver , where PKj represents the j-th IP packet sent by the server to the client; the corresponding client-server flow FClient will be composed by (NClient + 1) IP packets. At the IP layer, each flow F can be characterized as an ordered sequence of N pairs Pi = (Ti ; Li), with 1 < i < N, where Li represents the size of PKi (including TCP/IP Header) and Ti represents the inter-arrival time between PKi-1 and PKi. In our study we consider only semantically complete TCP flows, namely flows starting with SYN or SYN-ACK TCP segment (respectively for client-to-server and server-to-client direction). Because of the limited number of packets considered in this work, we don’t care about the FIN TCP segment to be observed. With this in mind, we aim to recognize a description of protocols (through clustering techniques): such a description should be based on the first few packets of the flows and should be able to strongly characterize each analyzed protocol. The purpose of this work is the definition of an algorithm that takes as input a traffic flow from an unknown application and that gives as output the (probable) application responsible of its generation. 180 A. Baiocchi et al. 3 Dataset Description In this work, we focus our attention on five different application layer protocols, namely HTTP, FTP-Control, POP3, SMTP and SSH, which are among the most used protocols on the INTERNET. As for HTTP and FTP-Control (FTP-C in the following), we collected traffic traces in the Networking Lab at our Department. By means of automated tools mounted on machines within the Lab, thousands of web pages carefully selected have been visited in a random order, over thousands of web sites distributed in various geographical areas (Italy, Europe, North America, Asia). FTP sites have been addressed as well and control FTP session established with thousands remote servers, again distributed in a wide area. The generated traffic has been captured on our LAN switch; we verified that the TCP connections bottleneck was never the link connecting our LAN to the big Internet to avoid the measured inter-arrival times to be too noisy. This experimental set up, while allowing the capture of artificial traffic that (realistically) emulates user activity, gives us traces with reliable application layer protocol classification. Traffic flows for the other protocols (POP3, SMTP, SSH) are extracted form backbone traffic traces made available by CAIDA. Precisely, we randomically extracted flows from the OC-48 traces of the days 2002-08-14, 2003-01-15 and 2003-04-24. Due to privacy reasons, only anonymized packet traces with no payloads are made available. Regarding to SSH, it can be configured as an encrypted tunnel to transport every kind of applications. Even in its ”normal” behavior (remote management), it would have to be difficult to recognize a specific behavioral pattern due to its humaninteractive nature. For these reasons we expect that the classification results involving SSH flows will be worse than those without them. Starting from these traffic traces, and focusing our attention only to semantically complete server-client flows, we created two different data sets with 1000 flows for each application. Each flow in a data set is described by the following fields: • • a protocol label coded as an integer from 1 to 5; P ≥ 1 couples (Ti ; Li), where Ti is the inter-arrival time (difference between timestamps) of the (i – 1)-th and i-th packet of the considered flow and Li is the IP packet length of the i-th packet of the flow, i = 1,…, P. Inter-arrival times are in seconds, packet lengths are in bytes. The 0-th packet of a flow, used as a reference for the first inter-arrival time, is conventionally defined as the one carrying the SYN-ACK TCP segment for the server-toclient direction. The label in the first field is used as the target class for flows in both the training and test sets. The other 2P quantities are normalized and define a 2P-dimensional array associated to the considered flow. Normalization has to be done carefully: we choose to normalize packet lengths between 40 to 1500 bytes, which is the minimun/maximum observed length. As for inter-arrival times, normalization is done over an entire data-set of M flows making up a training or a test set. Let (T i ( j ) , L i ( j ) ) be the i-th couple of the j-th flow (i = 1,…, P ; j = 1,…,M). Then we let: On-the-fly Statistical Classification of Internet Traffic Tˆi ( j ) = Ti ( j ) − min Ti ( k ) 1≤ k ≤ M max Ti ( k ) − min Ti ( k ) 1≤ k ≤ M L ( j ) − 40 Lˆi ( j ) = i 1500 − 40 181 i = 1,..., P 1≤ k ≤ M (1) i = 1,..., P In the following we assume P = 5. A different version of this data set has also been used, so called pre-processed data set. In this last case, inter-arrival times are modified to be the differential inter-arrival times, obtained as DTi = Ti ¡ T0; i = 1,…,P, where T0 is the time elapsing between the packet carrying the TCP SYN-ACK of the flow and the next packet, most of times a presentation message (as for FTP-C) or an ACK (as for HTTP). So, T0 approximates the first RTT of the connection, including only time depending on TCP computation (as we have seen during our experimental setup). The differential delay can therefore be expressed as DTi = Ti − T0 ≈ RTTi − RTT0 + TAi (2) Where we account for the fact that T0 ≈ RTT0 and that Ti comprises the i-th RTT and in general an application dependent time, TAi. Hence, we expect that application layer protocol features are more evident in the pre-processed data set as compared to theplain one, since the contribution of the applications to interarrival times is usually much smaller than the average RTT in a wide area network. On the other hand, in case of differential inter-arrival times, the noise affecting the application dependent inter-arrival times is reduced to the RTT variation (zero on the average). 4 A Basic Classification System Based on Cluster Analysis In this section some details about the adopted classification system are exploited. Basically a classification problem can be defined as follows. Let P : X → L be an unknown oriented process to be modeled, where X is the domain set and the codomain L is a label set, i.e. a set in which it is not possible (or misleading) to define an ordering function and hence any dissimilarity measure between its elements. If P is a single value function, we will call it classification function. Let Str and Sts be two sets of input-output pairs, namely the training set and the test set. We will call instance of a classification problem a given pair (Str , Sts) with the constrain Str ∩ Sts =Ø . A classification system is a pair (M , TAi), where TA is the training algorithm, i.e. the set of instructions responsible for generating, exclusively on the basis of Str, a particular instance M¯ of the classification model family M, such that the classification error of M¯ computed on Sts will be minimized. The generalization capability, i.e. the capability to correctly classify any pattern belonging to the input space of the oriented process domain to be modeled, is for sure the most important desired feature of a classification system. From this point of view, the mean classification error on Sts 182 A. Baiocchi et al. can be considered as an estimate of the expected behavior of the classifier over all the possible inputs. In the following, we describe a classification system trained by an unsupervised (clustering) procedure. When dealing with patterns belonging to the Rn vectorial space we can adopt a distance measure, such as the Euclidean distance; moreover, in this case we can define the prototype of the cluster as the centroid (the mean vector) of all the patterns in the cluster, thanks to the algebraic structure defined in Rn. Consequently, the distance between a given pattern xi and a cluster Ck can be easily defined as the Euclidean distance d(xi ; µ k) where µ k is the centroid of the pattern belonging to Ck: μk = 1 mk ∑x xi ∈Ck i (3) A direct way to synthesize a classification model on the basis of a training set Str consists in partitioning the patterns in the input space (discarding the class label information) by a clustering algorithm (in our case, by the K-means). Successively, each cluster is labeled by the most frequent class among its patterns. Thus, a classification model is a set of labeled clusters (centroids); note that more than one cluster can be associated with the same label, i.e. a class can be represented by more than one cluster. Assuming to represent a floating point number with four bytes, the amount of memory needed to store a classification model is K · (4 · n + 1) bytes, where n is the input space dimension and assuming to code class labels with one byte. An unlabeled pattern x is classified by determining the closest centroid µ i (and thus the closest cluster Ci) and by labeling x with the same class label associated with Ci. It is important to underline that, since the initialization step of the K-Means is not deterministic, in order to compute a precise estimation of the performance of the classification model on the test set Sts, the whole algorithm must be run several times, averaging the classification errors on Sts yielded by the different classification models obtained in each run. 5 Numerical Results In this section we provide numerical results of the classification algorithm. We investigated two groups of applications, the first containing HTTP, FTP-C, POP3 and SMTP, the second including also SSH. Using the non preprocessed data set (hereinafter referred as original) we obtain a classification accuracy on Sts comparable with that achievable with port-based classification. This happens because the effect of RTT almost completely covers the information carried from inter-arrival times. In Table 1 and 2 are listed the global results and the individual contributions of the protocols to the average value. Using the pre-processed data set we obtain much better results, in particular for the case with only 4 protocols as we can see in Table 3 and Table 4. An important thing to consider is the complexity of the classification model, namely the number of clusters used. The performance does not significantly increase after 20 clusters (Fig. 2 and Fig. 3): this means we can achieve good results with a simple model that requires On-the-fly Statistical Classification of Internet Traffic 183 Table 1. Average classification accuracy vs # Clusters, P=5, # flows (training+test)=1000, original data set Table 2. Average classification accuracy vs # Clusters, P=5, # flows (training+test) =1000, original data set Table 3. Average classification accuracy vs # Clusters, P=5, # flows (training+test)=1000, preprocessed data set not much computation. We can see from Table 4 the negative impact SSH has on the overall classification accuracy, mainly because it is a human-driven protocol, hard to characterize with few hundreds of flows. Moreover, we can see that SSH suffers of overfitting problems, as the probability of success decreases as the number of clusters increases. 184 A. Baiocchi et al. Table 4. Average classification accuracy vs # Clusters, P=5, # flows (training+test)=1000, preprocessed data set Extending the data sets with unknown traffic, the classification probability is significantly reduced. Although the performance of the overall classification accuracy decreases, we can see in Table 5 the effect of unknown traffic. This means that our classifier is not mistaking flows of the considered protocols, but is just raising the false positive classifications due to unknown traffic, erroneously labeled as known traffic. Table 5. Average classification accuracy vs # Clusters, P=5, # flows (training+test)=1000, preprocessed data set with HTTP, FTP-C, POP3, SMTP, SSH, Unknown 6 Concluding Remarks In this work we present a model that could be useful to address the problem of traffic classification. To this end, we use only (poor) information available at network layer, namely packets size and inter-arrival times. In the next future we plan to better test the performances of this model, mainly extending the data sets we use to a greater number of protocols. We are also planning to collect traffic traces from our Department link to be able to accurately classify all protocols we want to analyze through payload analysis. Moreover, we will have to enforce the C-means algorithm to automatically select the optimal number of clusters relatively to the used data set. The following step will On-the-fly Statistical Classification of Internet Traffic 185 be the use of more recent and powerful fuzzy-like algorithms to achieve better performances in a real environment. Acknowledgments. Authors thank Claudio Mammone for his development work of the software package used in the collection of part of traffic traces analysed in this work. References 1. Karagiannis, T., Papagiannaki, K., Faloutsos, M.: BLINC: Multilevel traffic classification in the dark. In: Proc. of ACM SIGCOMM 2005, Philadelphia, PA, USA (August 2005) 2. Crotti, M., Dusi, M., Gringoli, F., Salgarelli, L.: Traffic Classification through Simple Statistical Fingerprinting. ACM SIGCOMM Computer Communication Review 37(1), 5–16 (2007) 3. Wright, C., Monrose, F., Masson, G.: On Inferring Application Protocol Behaviors in Encrypted Network Traffic. Journal of Machine Learning Research (JMLR): Special issue on Machine Learning for Computer Security 7, 2745–2769 (2006) 4. Moore, A.W., Zuev, D.: Internet traffic classification using Bayesian analysis techniques. In: ACM SIGMETRICS 2005, Banff, Alberta, Canada (June 2005) 5. McGregor, A., Hall, M., Lorier, P., Brunskill, J.: Flow clustering using machine learning techniques. In: PAM 2004, Antibes Juan-les-Pins, France (April 2004) 6. Zander, S., Nguyen, T., Armitage, G.: Automated traffic classification and application identification using machine learning. In: LCN 2005, Sydney, Australia (November 2005) 7. Bernaille, L., Teixeira, R., Salamatian, K.: ’Early Application Identification. In: Proceedings of CoNEXT (December 2006)
© Copyright 2025 Paperzz