00530139.pdf

An Intrusion Detection System Based on Hierarchical
Self-Organization
E.J. Palomo, E. Domínguez, R.M. Luque, and J. Muñoz
Department of Computer Science
E.T.S.I. Informatica, University of Malaga
Campus Teatinos s/n, 29071 – Malaga, Spain
{ejpalomo,enriqued,rmluque,munozp}@lcc.uma.es
Abstract. An intrusion detection system (IDS) monitors the IP packets flowing over the network to capture intrusions or anomalies. One of the techniques used for anomaly detection is
building statistical models using metrics derived from observation of the user's actions. A neural network model based on self organization is proposed for detecting intrusions. The selforganizing map (SOM) has shown to be successful for the analysis of high-dimensional input
data as in data mining applications such as network security. The proposed growing hierarchical SOM (GHSOM) addresses the limitations of the SOM related to the static architecture of this
model. The GHSOM is an artificial neural network model with hierarchical architecture composed of independent growing SOMs. Randomly selected subsets that contain both attacks and
normal records from the KDD Cup 1999 benchmark are used for training the proposed
GHSOM.
Keywords: Network security, self-organization, intrusion detection.
1 Introduction
Nowadays, network communications become more and more important to the information society. Business, e-commerce and other network transactions require more
secured networks. As these operations increases, computer crimes and attacks become
more frequents and dangerous, compromising the security and the trust of a computer
system and causing costly financial losses. In order to detect and prevent these attacks, intrusion detection systems have been used and have become an important area
of research over the years.
An intrusion detection system (IDS) monitors the network traffic to detect intrusions or anomalies. There are two different approaches used to detect intrusions [1].
The first approach is known as misuse detection, which compares previously stored
signatures of known attacks. This method is good detecting many or all known attacks, having a successful detection rate. However, they are not successful in detecting unknown attacks occurrences and the signature database has to be manually
modified. The second approach is known as anomaly detection. First, these methods
establish a normal activity profile. Thus, variations from this normal activity are considered anomalous activity. Anomaly-based systems assume that anomalous activities
are intrusion attempts. Many of these anomalous activities are frequently normal
E. Corchado et al. (Eds.): CISIS 2008, ASC 53, pp. 139–146, 2009.
springerlink.com
© Springer-Verlag Berlin Heidelberg 2009
140
E.J. Palomo et al.
activities, showing false positives. Many anomaly detection systems build statistical
models using metrics derived from the user's actions [1]. Also, anomaly detection
systems using data mining techniques such as clustering, support vector machines
(SVM) and neural network systems have been proposed [2-4]. Several IDS using a
self-organizing maps have been done, however they have many difficulties detecting
a wide variety of attacks with low false positive rates [5].
The self-organizing map (SOM) [6] has been applied successfully in multiple
areas. However, the network architecture of SOMs has to be established in advance
and it requires knowledge about the problem domain. Moreover, the hierarchical
relations among input data are difficult to represent. The growing hierarchical SOM
(GHSOM) faces these problems. The GHSOM is an artificial neural network which
consists of several growing SOMs [7] arranged in layers. The number of layers, maps
and neurons of maps are automatically adapted and established during a training
process of the GHSOM, according to a set of input patterns fed to the neural network.
Applied to an intrusion detection system, the input patterns will be samples of network traffic, which can be classified as anomalous or normal activities. The data analyzed from network traffic can be numerical (i.e. number of seconds of the connection) or symbolic (i.e. protocol type used). Usually, the Euclidean distance has been
used as metric to compare two input data. However, this is not suitable for symbolic
data. Therefore, in this paper an alternative metric for GHSOMs where both symbolic
and numerical data are considered is proposed.
The implemented IDS using the GHSOM, was trained with the KDD Cup 1999
benchmark data set [8]. This data set has served as the first and only reliable benchmark data set that has been used for most of the research work on intrusion detection
algorithms [9]. This data set includes a wide variety of simulated attacks.
The remainder of this paper is organized as follows. Section 2 discusses the new
GHSOM model used to build the IDS. Then, the training algorithm is described. In
Section 3, we show some experimental results obtained from comparing our implementation of the Intrusion Detection System (IDS) with other related works, where
data from the KDD Cup 1999 benchmark are used. Section 4 concludes this paper.
2 GHSOM for Intrusion Detection
The IDS implemented is based on GHSOM. Initially, the GHSOM consists of a single
SOM of 2x2 neurons. Then, the GHSOM adapts its very architecture depending on
the input patterns, so that the GHSOM structure mirrors the structure of the input data
getting a good data representation. This neural network structure is used to classify
the input data in groups, where each neuron represents a data group with similar
features.
The level of data representation of each neuron is measured as the quantization error of the neuron ( ). The
is a measure of the similarity of a data set, where the
higher is the , the higher is the heterogeneity of the data. Usually the
has been
used in terms of the Euclidean distance between the input pattern and the weight vector of a neuron. However, in many real life problems not only numerical features are
present, but also symbolic features can be found. To take an obvious example, among
the features to analyze from data for building an IDS, three symbolic features are
found: protocol type (UDP and ICMP), service (i.e. HTTP, SMTP, FTP, TELNET,
An Intrusion Detection System Based on Hierarchical Self-Organization
141
etc.) and flag of the status of the connection (i.e. SF, S1, REJ, etc.). Unlike numerical
data, symbolic data do not have an order associated and cannot be measured by a
distance. It makes no sense to use the Euclidean distance between two symbolic values, for example the distance between HTTP and FTP protocol. It seems better to use
a similarity measure rather than a distance measure for symbolic data. For that reason,
in this paper we introduce the entropy as similarity measure of error in representation
of a neuron for symbolic data together with the Euclidean distance for numerical data.
Fig. 1. Sample architecture of a GHSOM
Let
be the th input pattern, where
is the vector component of nu-
is the component of symbolic features. The error of a unit
merical features and
( ) in the representation is defined as follows:
C
w
x
p x log p x
,
(1)
and
are the error components of numerical and symbolic features, rewhere
is the probability
spectively, is the set of patterns mapped onto the unit , and
of the element in . The quantization error of the unit is given by expression (2).
| |.
(2)
First of all, the quantization error at layer 0 map ( ) has to be computed as it is shown
above. In this case, the error neuron is computed as specified in (1), but using
inas the mean of the all input data , being the set of input patterns the set .
stead
The training of a map is done as the training of a single SOM [6]. An input pattern
is randomly selected and each neuron determines its activation according to a similarity measure. In our case, since we take into account the presence of symbolic and
142
E.J. Palomo et al.
numerical data, the neuron with the smallest similarity measure defined in (3), becomes the winner.
1, 2
.
log
(3)
For numerical component, is the Euclidean distance between two vectors. For
symbolic data, it checks whether the two vectors are the same or not, that is, the probability can just take the values 1, if the vectors are the same; or 0.5 if they are differand
are equal; and
ent. Therefore, for symbolic data the value of will be 0, if
if they are not the same. By taking into account the new similarity measure, the index
of the winner neuron is defined in (4).
min|
,
|.
(4)
is adapted according to the expression (5). For
The weight vector of a neuron
numerical component, the winner and its neighbors, whose amount of adaptation
follows a Gaussian neighborhood function , are adpated. For symbolic data, just the
winner is adapted with the weight vector of the mode of the set of input patterns
mapped onto the winner .
1
1
1
(5)
The GHSOM growing and expansion is controlled by means of two parameters: ,
which is used to control the growth of a single map; and , which is used to control
the expansion of each neuron of the GHSOM. Specifically, a map stops growing if
(
) reaches a certain fraction
of the
of the corthe mean of the map's
responding neuron that was expanded in the map . Also, a neuron is not expanded
of
. Thus, the
if its quantization error ( ) is smaller than a certain fraction
larger the paremeter
is chosen the deeper the hierarchy will be. Also, for large
values, we will have large maps. This way, with these two paremeters, a control of the
resulting hierarchical architecture is provided. Note that these parameters are the only
ones that have to be established in advance.
The pseudocode of the training algorithm of the proposed GHSOM is defined as
follows.
Step 1. Compute the mean of the all input data
and then, the initial quantization error
.
Step 2. Create an initial map with 2x2 neurons.
Step 3. Train the map during iterations as a single SOM, using the expressions
(3), (4) and (5).
Step 4. Compute the quantification errors of each neuron according to the expression (2).
Step 5. Calculate the mean of all units’ quantization errors
of the map
(
.
Step 6. If
go to step 9, where
is the
of the corresponding unit in the upper layer that was expanded. For the first layer, this
is
.
An Intrusion Detection System Based on Hierarchical Self-Organization
Step 7. Select the neuron with the highest
according to the expression (6).
arg max|
,
|
143
and its most dissimilar neighbor
Λ
(6)
Step 8. Insert a row or column of neurons between and , initializing their
weight vectors as the means of their respective neighbors. Go to step 3.
Step 9. If
for all neurons in the map, go to step 11.
Step 10. Select an unsatisfied neuron and expand it creating a new map in the lower
layer. The parent of the new map is the expanded neuron and their weight
vectors are initialized as the mean of their parent and neighbors. Go to
step 2.
Step 11. If exists remaining maps, select one and go to step 3. Otherwise, the algorithm ends.
3 Experimental Results
Our IDS based on the new GHSOM model was trained with the pre-processed KDD
Cup 1999 benchmark data set created by MIT Lincoln Laboratory. The purpose of
this benchmark was to build a network intrusion detector capable of distinguishing
between intrusions or attacks, and normal connections. We have used the 10% KDD
Cup 1999 benchmark data set, which contains 494021 connection records, each of
them with 41 features. Here, a connection is a sequence of TCP packets which flows
between a source and a target IP addresses. Since some features alone cannot constitute a sign of an anomalous activity, it is better analyzes connection records rather
than individual packets. Among the 41 features, three are symbolic: protocol type,
service and status of the connection flag. In the training data set exist 22 attack types
and in addition to normal records, which fall into four main categories [10]: Denial of
Service (DoS), Probe, Remote-to-Local (R2L) and User-to-Root (U2R).
In this paper, we have selected two data subsets for training the GHSOM from the
total of 494021 connection records, SetA with 100000 connection records and SetB
with 169000. Both SetA and SetB contain the 22 attack types. We try to select the
data in such a way that there was the same distribution for all the record types. However, the distribution of the data in the 10% KDD Cup data set has an irregular distribution that finally was mirrored in our selection. The two data subsets were trained
with 0.1 as value for parameters and , since with these values we achieved good
results and a very simple architecture. In fact, each trained GHSOM generated just
Table 1. Training results for SetA and SetB
Training Set Detected (%) False Positive (%) Identified (%)
SetA
99.98
3.03
94.99
SetB
99.98
5.74
95.09
two layers with 16 neurons, although with a different arrangement. The GHSOM
trained with SetB is the same that we showed in Fig. 1.
144
E.J. Palomo et al.
Many related works are only interested in classifying the input pattern just as one
of two record types: anomalous or normal records. Taking into account just two
groups, normal records that are classified as anomalous are known as false positives,
whereas anomalous records that are classified as normal records are known as missed
detections. However, we are also interested in classify an anomalous record into its
attack type, that is, taking into account 23 groups (22 attack types plus normal
records) instead 2 groups. Hence, we call identification rate to the connection records
that are correctly identified as their respective record types. The training results of the
two GHSOMs obtained with the subsets SetA and SetB are shown in Table 1. Both
subsets achieve 99.98% attack detection rate and false positive rates of 3.03% and
5.74%, respectively. Attending to the identification of the attack type, around the 95%
were correctly identified in both cases.
We have simulated the trained GHSOMs with the 494021 connection records from
the benchmark data set. This simulation consists of classifying these data with the
trained GHSOMs without any modification of the GHSOMs, that is, without learning
process. The simulation results of both GHSOMs are given in Table 2. Here, 99.9%
detection rate is achieved. Also, the identification rate rises up to 97% in both cases.
The false positive rate increases for SetA during the simulation, although it is lower
for SetB compared with this rate after the training. Note that during both training and
simulation, we used the 41 features and all the 22 existing attacks in the training data
set. Moreover, the resulting number of neurons was much reduced. In fact, there are
less neurons than groups, although this is due to the scarce amount of connection
records from certain attack groups.
Table 2. Simulating results with 494021 records for SetA and SetB
Training Set Detected (%) False Positive (%) Identified (%)
SetA
99.99
5.41
97.09
SetB
99.91
5.11
97.15
In Table 3, we compare the training results of SetA with the results obtained in [9,
10], where SOMs were used to build IDSs as well. In order to differentiate both IDS
based on self-organization, we call them SOM and K-Map, respectively, using the
author's notation, whereas our trained IDS is called GHSOM. From the first work, we
chose the only one SOM trained on all the 41 features, which was composed of 20x20
neurons. Another IDS implementation was proposed in the second work. Here, a
hierarchical Kohonen net (K-Map), composed of three layers, was trained with a
subset of 169000 connection records, and taking into account the 41 features and 22
attack types. Their best result was 99.63% detection rate after testing the K-Map,
although with several restrictions. This result was achieved using a pre-selected combination of 20 features, which were divided into three levels of features, where each
features sub combination was fed to a different layer. Also, they used just three attack
types during the testing, whereas we used the 22 attack types. Moreover, the architecture of the hierarchical K-Map was established in advance, using 48 neurons in each
layer, that is, 144 neurons, when we used just 16 neurons that were generated during
the training process without any human intervention.
An Intrusion Detection System Based on Hierarchical Self-Organization
145
Table 3. Comparison results for different IDS implementations
GHSOM
SOM
K-Map
Detected (%) False Positive (%)
99.98
3.03
81.85
0.03
99.63
0.34
4 Conclusions
This paper has presented a novel Intrusion Detection System based on growing hierarchical self-organizing maps (GHSOMs). These neural networks are composed of
several SOMs arranged in layers, where the number of layers, maps and neurons are
established during the training process mirroring the inherent data structure. Moreover, we have taken into account the presence of symbolic features in addition to numerical features in the input data. In order to improve the classification process of
input data, we have introduced a new metric for GHSOMs based on entropy for symbolic data together with the Euclidean distance for numerical data.
We have used the 10% KDD Cup 1999 benchmark data set to train our IDS based
on GHSOM, which contains 494021 connection records, where 22 attack types in
addition to normal records can be found. We trained and simulated two GHSOMs
with two different subsets, SetA with 100000 connection records and SetB with
169000 connection records. Both SetA and SetB achieved 99.98% detection rate and
false positives rates of 3.03% and 5.74%, respectively. The identification rate, that is,
the connection records identified as their correct connection record types, was 94.99%
and 95.09%, respectively. After the simulation of the two trained GHSOMs with the
494021 connection records, we achieved 99.9% detection rate and false positive rates
between 5.11% and 5.41% in both subsets. The identification rate was around
the 97%.
Acknowledgements. This work is partially supported by the Spanish Ministry of
Education and Science under contract TIN-07362.
References
1. Denning, D.: An intrusion-detection model. Software Engineering. IEEE Transactions on
SE 13(2), 222–232 (1987)
2. Lee, W., Stolfo, S., Chan, P., Eskin, E., Fan, W., Miller, M., Hershkop, S., Zhang, J.: Real
time data mining-based intrusion detection. In: DARPA Information Survivability Conference & Exposition II, vol. 1, pp. 89–100 (2001)
3. Maxion, R., Tan, K.: Anomaly detection in embedded systems. IEEE Transactions on
Computers 51(2), 108–120 (2002)
4. Tan, K., Maxion, R.: Determining the operational limits of an anomaly-based intrusion detector. IEEE Journal on Selected Areas in Communications 21(1), 96–110 (2003)
5. Ying, H., Feng, T.J., Cao, J.K., Ding, X.Q., Zhou, Y.H.: Research on some problems in the
kohonen som algorithm. In: International Conference on Machine Learning and Cybernetics, vol. 3, pp. 1279–1282 (2002)
146
E.J. Palomo et al.
6. Kohonen, T.: Self-organized formation of topologically correct feature maps. Biological
cybernetics 43(1), 59–69 (1982)
7. Fritzke, B.: Growing grid - a self-organizing network with constant neighborhood range
and adaptation strength. Neural Processing Letters 2(5), 9–13 (1995)
8. Lee, W., Stolfo, S., Mok, K.: A data mining framework for building intrusion detection
models. In: IEEE Symposium on Security and Privacy, pp. 120–132 (1999)
9. Sarasamma, S., Zhu, Q., Hu, J.: Hierarchical kohonenen net for anomaly detection in network security. IEEE Transactions on Systems Man and Cybernetics Part BCybernetics 35(2), 302–312 (2005)
10. DeLooze, L., DeLooze, A.F.: Attack characterization and intrusion detection using an ensemble of self-organizing maps. In: 7th Annual IEEE Information Assurance Workshop,
pp. 108–115 (2006)