00530178.pdf

On-the-fly Statistical Classification of Internet Traffic
at Application Layer Based on Cluster Analysis
Andrea Baiocchi1, Gianluca Maiolini2, Giacomo Molina1, and Antonello Rizzi1
1
INFOCOM Dept., University of Roma “Sapienza”
Via Eudossiana 18 - 00184 Rome, Italy
[email protected], [email protected],
[email protected]
2 ELSAG Datamat – Divisione automazione sicurezza e trasporti,
Via Laurentina 760 – 00143 Rome, Italy
[email protected]
Abstract. We address the problem of classifying Internet packet flows according to the application level protocol that generated them. Unlike deep packet inspection, which reads up to application layer payloads and keeps track of packet sequences, we consider classification based on
statistical features extracted in real time from the packet flow, namely IP packet lengths and
inter-arrival times. A statistical classification algorithm is proposed, built upon the powerful
and rich tools of cluster analysis. By exploiting traffic traces taken at the Networking Lab of
our Department and traces from CAIDA, we defined data sets made up of thousands of flows
for up to five different application protocols. With the classic approach of training and test data
sets we show that cluster analysis yields very good results in spite of the little information it is
based on, to stick to the real time decision requirement. We aim to show that the investigated
applications are characterized from a ”signature” at the network layer that can be useful to
recognize such applications even when the port number is not significant. Numerical results are
presented to highlight the effect of major algorithm parameters. We discuss complexity and
possible exploitation of the statistical classifier.
1 Introduction
As broadband communications widen the range of popular applications, there is an
increasing demand of fast traffic classification means according to the services data is
generated by. The specific meaning of service depends on the context and purpose of
traffic classifications. In case of traffic filtering for security or policy enforcement
purposes, service can be usually identified with application layer protocol. However,
many kind of different services exploit http or ssh (e.g. file transfer, multimedia
communications, even P2P), so that a simple header based filter (e.g. exploiting the IP
address and TCP/UDP port numbers) may be inadequate.
Traffic classification at application level can be therefore based on the analysis of
the entire packets content (header plus payload), usually by means of finite state machine schemes. Although there are widely available software tools for such a classification approach (e.g. L7filter, BRO, Snort), they can hardly catch up with high speed
links and are usually inadequate for backbone use (e.g. Gbps links).
E. Corchado et al. (Eds.): CISIS 2008, ASC 53, pp. 178–185, 2009.
springerlink.com
© Springer-Verlag Berlin Heidelberg 2009
On-the-fly Statistical Classification of Internet Traffic
179
The solution based on port analysis is becoming ineffective because of applications
running on non-standard ports (e.g. peer-to-peer). Furthermore, traffic classification
based on deep packet inspection is resource-consuming and hard to implement on
high capacity links (e.g. Gbps OC links). For these reasons, different approaches to
traffic classification have been developed, using all the information available at network layer. Some proposals ([4], [5]), however, need semantically complete TCP
flows as input: we target a real-time tool, able to classify the application layer protocol of a TCP connection by observing just the first few packets of the connection
(hereinafter referred to as a flow).
A number of works [5], [6], [7] rely on unsupervised learning techniques. The only
features they use is packets size and packets direction: they demonstrate the effectiveness of these algorithms even using a small number of packets (e.g. the first four of a
TCP connection).
We believe that even packets inter-arrival time contains pieces of information relevant to address the classification problem. We provide a way to exalt the information
contained in inter-arrival times, preserving the real-time characteristic of the approach
described in [7]: we try to clean the interarrival time (as better explained in section 3)
assessing the contribution of network congestion, to exalt the time depending on the
application layer protocol.
The paper is organized as follows. In Section 2 the classification problem is defined and notation is introduced. Section 3 is devoted to the description of the traffic
data sets used for the defined algorithm assessment and the numerical evaluation. The
cluster analysis based statistical classifier is defined in Section 4. Numerical examples
are given in Section 5 and final remarks are stated in Section 6.
2 Problem Statement
In this paper, we focus on the classification of IP flows generated from network applications communicating through TCP protocol as HTTP, SMTP, POP3, etc. With this
in mind, we define flow F as the unidirectional, ordered sequence of IP packets produced either by the client towards the server, or by the server towards the client during an application layer session. The server-client flow FServer will be composed of
(NServer + 1) IP packets, from PK0 to PKNserver , where PKj represents the j-th IP packet
sent by the server to the client; the corresponding client-server flow FClient will be
composed by (NClient + 1) IP packets. At the IP layer, each flow F can be characterized
as an ordered sequence of N pairs Pi = (Ti ; Li), with 1 < i < N, where Li represents the
size of PKi (including TCP/IP Header) and Ti represents the inter-arrival time between
PKi-1 and PKi.
In our study we consider only semantically complete TCP flows, namely flows
starting with SYN or SYN-ACK TCP segment (respectively for client-to-server and
server-to-client direction). Because of the limited number of packets considered in
this work, we don’t care about the FIN TCP segment to be observed.
With this in mind, we aim to recognize a description of protocols (through clustering
techniques): such a description should be based on the first few packets of the flows and
should be able to strongly characterize each analyzed protocol. The purpose of this work
is the definition of an algorithm that takes as input a traffic flow from an unknown application and that gives as output the (probable) application responsible of its generation.
180
A. Baiocchi et al.
3 Dataset Description
In this work, we focus our attention on five different application layer protocols,
namely HTTP, FTP-Control, POP3, SMTP and SSH, which are among the most used
protocols on the INTERNET.
As for HTTP and FTP-Control (FTP-C in the following), we collected traffic traces
in the Networking Lab at our Department. By means of automated tools mounted on
machines within the Lab, thousands of web pages carefully selected have been visited
in a random order, over thousands of web sites distributed in various geographical
areas (Italy, Europe, North America, Asia). FTP sites have been addressed as well and
control FTP session established with thousands remote servers, again distributed in a
wide area. The generated traffic has been captured on our LAN switch; we verified
that the TCP connections bottleneck was never the link connecting our LAN to the
big Internet to avoid the measured inter-arrival times to be too noisy. This experimental set up, while allowing the capture of artificial traffic that (realistically) emulates
user activity, gives us traces with reliable application layer protocol classification.
Traffic flows for the other protocols (POP3, SMTP, SSH) are extracted form backbone traffic traces made available by CAIDA. Precisely, we randomically extracted
flows from the OC-48 traces of the days 2002-08-14, 2003-01-15 and 2003-04-24.
Due to privacy reasons, only anonymized packet traces with no payloads are made
available. Regarding to SSH, it can be configured as an encrypted tunnel to transport
every kind of applications. Even in its ”normal” behavior (remote management), it
would have to be difficult to recognize a specific behavioral pattern due to its humaninteractive nature. For these reasons we expect that the classification results involving
SSH flows will be worse than those without them.
Starting from these traffic traces, and focusing our attention only to semantically
complete server-client flows, we created two different data sets with 1000 flows for
each application. Each flow in a data set is described by the following fields:
•
•
a protocol label coded as an integer from 1 to 5;
P ≥ 1 couples (Ti ; Li), where Ti is the inter-arrival time (difference between
timestamps) of the (i – 1)-th and i-th packet of the considered flow and Li is the
IP packet length of the i-th packet of the flow, i = 1,…, P.
Inter-arrival times are in seconds, packet lengths are in bytes.
The 0-th packet of a flow, used as a reference for the first inter-arrival time, is conventionally defined as the one carrying the SYN-ACK TCP segment for the server-toclient direction.
The label in the first field is used as the target class for flows in both the training
and test sets. The other 2P quantities are normalized and define a 2P-dimensional
array associated to the considered flow. Normalization has to be done carefully: we
choose to normalize packet lengths between 40 to 1500 bytes, which is the minimun/maximum observed length. As for inter-arrival times, normalization is done over
an entire data-set of M flows making up a training or a test set. Let (T i ( j ) , L i ( j ) )
be the i-th couple of the j-th flow (i = 1,…, P ; j = 1,…,M). Then we let:
On-the-fly Statistical Classification of Internet Traffic
Tˆi ( j ) =
Ti ( j ) − min Ti ( k )
1≤ k ≤ M
max Ti ( k ) − min Ti ( k )
1≤ k ≤ M
L ( j ) − 40
Lˆi ( j ) = i
1500 − 40
181
i = 1,..., P
1≤ k ≤ M
(1)
i = 1,..., P
In the following we assume P = 5.
A different version of this data set has also been used, so called pre-processed data
set. In this last case, inter-arrival times are modified to be the differential inter-arrival
times, obtained as DTi = Ti ¡ T0; i = 1,…,P, where T0 is the time elapsing between the
packet carrying the TCP SYN-ACK of the flow and the next packet, most of times a
presentation message (as for FTP-C) or an ACK (as for HTTP). So, T0 approximates
the first RTT of the connection, including only time depending on TCP computation
(as we have seen during our experimental setup). The differential delay can therefore
be expressed as
DTi = Ti − T0 ≈ RTTi − RTT0 + TAi
(2)
Where we account for the fact that T0 ≈ RTT0 and that Ti comprises the i-th RTT and
in general an application dependent time, TAi.
Hence, we expect that application layer protocol features are more evident in the
pre-processed data set as compared to theplain one, since the contribution of the applications to interarrival times is usually much smaller than the average RTT in a
wide area network. On the other hand, in case of differential inter-arrival times, the
noise affecting the application dependent inter-arrival times is reduced to the RTT
variation (zero on the average).
4 A Basic Classification System Based on Cluster Analysis
In this section some details about the adopted classification system are exploited.
Basically a classification problem can be defined as follows. Let P : X → L be an
unknown oriented process to be modeled, where X is the domain set and the codomain
L is a label set, i.e. a set in which it is not possible (or misleading) to define an ordering function and hence any dissimilarity measure between its elements.
If P is a single value function, we will call it classification function. Let Str and Sts
be two sets of input-output pairs, namely the training set and the test set. We will call
instance of a classification problem a given pair (Str , Sts) with the constrain Str ∩ Sts
=Ø . A classification system is a pair (M , TAi), where TA is the training algorithm,
i.e. the set of instructions responsible for generating, exclusively on the basis of Str, a
particular instance M¯ of the classification model family M, such that the classification error of M¯ computed on Sts will be minimized. The generalization capability,
i.e. the capability to correctly classify any pattern belonging to the input space of the
oriented process domain to be modeled, is for sure the most important desired feature
of a classification system. From this point of view, the mean classification error on Sts
182
A. Baiocchi et al.
can be considered as an estimate of the expected behavior of the classifier over all the
possible inputs. In the following, we describe a classification system trained by an
unsupervised (clustering) procedure.
When dealing with patterns belonging to the Rn vectorial space we can adopt a distance measure, such as the Euclidean distance; moreover, in this case we can define
the prototype of the cluster as the centroid (the mean vector) of all the patterns in the
cluster, thanks to the algebraic structure defined in Rn. Consequently, the distance
between a given pattern xi and a cluster Ck can be easily defined as the Euclidean
distance d(xi ; µ k) where µ k is the centroid of the pattern belonging to Ck:
μk =
1
mk
∑x
xi ∈Ck
i
(3)
A direct way to synthesize a classification model on the basis of a training set Str consists in partitioning the patterns in the input space (discarding the class label information) by a clustering algorithm (in our case, by the K-means).
Successively, each cluster is labeled by the most frequent class among its patterns.
Thus, a classification model is a set of labeled clusters (centroids); note that more than
one cluster can be associated with the same label, i.e. a class can be represented by
more than one cluster. Assuming to represent a floating point number with four bytes,
the amount of memory needed to store a classification model is K · (4 · n + 1) bytes,
where n is the input space dimension and assuming to code class labels with one byte.
An unlabeled pattern x is classified by determining the closest centroid µ i (and thus
the closest cluster Ci) and by labeling x with the same class label associated with Ci. It
is important to underline that, since the initialization step of the K-Means is not
deterministic, in order to compute a precise estimation of the performance of the classification model on the test set Sts, the whole algorithm must be run several times,
averaging the classification errors on Sts yielded by the different classification models
obtained in each run.
5
Numerical Results
In this section we provide numerical results of the classification algorithm. We investigated two groups of applications, the first containing HTTP, FTP-C, POP3 and
SMTP, the second including also SSH. Using the non preprocessed data set (hereinafter referred as original) we obtain a classification accuracy on Sts comparable with
that achievable with port-based classification. This happens because the effect of RTT
almost completely covers the information carried from inter-arrival times. In Table 1
and 2 are listed the global results and the individual contributions of the protocols to
the average value.
Using the pre-processed data set we obtain much better results, in particular for the
case with only 4 protocols as we can see in Table 3 and Table 4. An important thing
to consider is the complexity of the classification model, namely the number of clusters used. The performance does not significantly increase after 20 clusters (Fig. 2 and
Fig. 3): this means we can achieve good results with a simple model that requires
On-the-fly Statistical Classification of Internet Traffic
183
Table 1. Average classification accuracy vs # Clusters, P=5, # flows (training+test)=1000,
original data set
Table 2. Average classification accuracy vs # Clusters, P=5, # flows (training+test) =1000,
original data set
Table 3. Average classification accuracy vs # Clusters, P=5, # flows (training+test)=1000, preprocessed data set
not much computation. We can see from Table 4 the negative impact SSH has on the
overall classification accuracy, mainly because it is a human-driven protocol, hard to characterize with few hundreds of flows. Moreover, we can see that SSH suffers of overfitting
problems, as the probability of success decreases as the number of clusters increases.
184
A. Baiocchi et al.
Table 4. Average classification accuracy vs # Clusters, P=5, # flows (training+test)=1000, preprocessed data set
Extending the data sets with unknown traffic, the classification probability is significantly reduced. Although the performance of the overall classification accuracy
decreases, we can see in Table 5 the effect of unknown traffic. This means that our
classifier is not mistaking flows of the considered protocols, but is just raising the
false positive classifications due to unknown traffic, erroneously labeled as known
traffic.
Table 5. Average classification accuracy vs # Clusters, P=5, # flows (training+test)=1000, preprocessed data set with HTTP, FTP-C, POP3, SMTP, SSH, Unknown
6 Concluding Remarks
In this work we present a model that could be useful to address the problem of traffic
classification. To this end, we use only (poor) information available at network layer,
namely packets size and inter-arrival times. In the next future we plan to better test
the performances of this model, mainly extending the data sets we use to a greater
number of protocols. We are also planning to collect traffic traces from our Department link to be able to accurately classify all protocols we want to analyze through
payload analysis.
Moreover, we will have to enforce the C-means algorithm to automatically select
the optimal number of clusters relatively to the used data set. The following step will
On-the-fly Statistical Classification of Internet Traffic
185
be the use of more recent and powerful fuzzy-like algorithms to achieve better performances in a real environment.
Acknowledgments. Authors thank Claudio Mammone for his development work of
the software package used in the collection of part of traffic traces analysed in this
work.
References
1. Karagiannis, T., Papagiannaki, K., Faloutsos, M.: BLINC: Multilevel traffic classification in
the dark. In: Proc. of ACM SIGCOMM 2005, Philadelphia, PA, USA (August 2005)
2. Crotti, M., Dusi, M., Gringoli, F., Salgarelli, L.: Traffic Classification through Simple Statistical Fingerprinting. ACM SIGCOMM Computer Communication Review 37(1), 5–16
(2007)
3. Wright, C., Monrose, F., Masson, G.: On Inferring Application Protocol Behaviors in Encrypted Network Traffic. Journal of Machine Learning Research (JMLR): Special issue on
Machine Learning for Computer Security 7, 2745–2769 (2006)
4. Moore, A.W., Zuev, D.: Internet traffic classification using Bayesian analysis techniques.
In: ACM SIGMETRICS 2005, Banff, Alberta, Canada (June 2005)
5. McGregor, A., Hall, M., Lorier, P., Brunskill, J.: Flow clustering using machine learning
techniques. In: PAM 2004, Antibes Juan-les-Pins, France (April 2004)
6. Zander, S., Nguyen, T., Armitage, G.: Automated traffic classification and application identification using machine learning. In: LCN 2005, Sydney, Australia (November 2005)
7. Bernaille, L., Teixeira, R., Salamatian, K.: ’Early Application Identification. In: Proceedings of CoNEXT (December 2006)