An Intrusion Detection System Based on Hierarchical Self-Organization E.J. Palomo, E. Domínguez, R.M. Luque, and J. Muñoz Department of Computer Science E.T.S.I. Informatica, University of Malaga Campus Teatinos s/n, 29071 – Malaga, Spain {ejpalomo,enriqued,rmluque,munozp}@lcc.uma.es Abstract. An intrusion detection system (IDS) monitors the IP packets flowing over the network to capture intrusions or anomalies. One of the techniques used for anomaly detection is building statistical models using metrics derived from observation of the user's actions. A neural network model based on self organization is proposed for detecting intrusions. The selforganizing map (SOM) has shown to be successful for the analysis of high-dimensional input data as in data mining applications such as network security. The proposed growing hierarchical SOM (GHSOM) addresses the limitations of the SOM related to the static architecture of this model. The GHSOM is an artificial neural network model with hierarchical architecture composed of independent growing SOMs. Randomly selected subsets that contain both attacks and normal records from the KDD Cup 1999 benchmark are used for training the proposed GHSOM. Keywords: Network security, self-organization, intrusion detection. 1 Introduction Nowadays, network communications become more and more important to the information society. Business, e-commerce and other network transactions require more secured networks. As these operations increases, computer crimes and attacks become more frequents and dangerous, compromising the security and the trust of a computer system and causing costly financial losses. In order to detect and prevent these attacks, intrusion detection systems have been used and have become an important area of research over the years. An intrusion detection system (IDS) monitors the network traffic to detect intrusions or anomalies. There are two different approaches used to detect intrusions [1]. The first approach is known as misuse detection, which compares previously stored signatures of known attacks. This method is good detecting many or all known attacks, having a successful detection rate. However, they are not successful in detecting unknown attacks occurrences and the signature database has to be manually modified. The second approach is known as anomaly detection. First, these methods establish a normal activity profile. Thus, variations from this normal activity are considered anomalous activity. Anomaly-based systems assume that anomalous activities are intrusion attempts. Many of these anomalous activities are frequently normal E. Corchado et al. (Eds.): CISIS 2008, ASC 53, pp. 139–146, 2009. springerlink.com © Springer-Verlag Berlin Heidelberg 2009 140 E.J. Palomo et al. activities, showing false positives. Many anomaly detection systems build statistical models using metrics derived from the user's actions [1]. Also, anomaly detection systems using data mining techniques such as clustering, support vector machines (SVM) and neural network systems have been proposed [2-4]. Several IDS using a self-organizing maps have been done, however they have many difficulties detecting a wide variety of attacks with low false positive rates [5]. The self-organizing map (SOM) [6] has been applied successfully in multiple areas. However, the network architecture of SOMs has to be established in advance and it requires knowledge about the problem domain. Moreover, the hierarchical relations among input data are difficult to represent. The growing hierarchical SOM (GHSOM) faces these problems. The GHSOM is an artificial neural network which consists of several growing SOMs [7] arranged in layers. The number of layers, maps and neurons of maps are automatically adapted and established during a training process of the GHSOM, according to a set of input patterns fed to the neural network. Applied to an intrusion detection system, the input patterns will be samples of network traffic, which can be classified as anomalous or normal activities. The data analyzed from network traffic can be numerical (i.e. number of seconds of the connection) or symbolic (i.e. protocol type used). Usually, the Euclidean distance has been used as metric to compare two input data. However, this is not suitable for symbolic data. Therefore, in this paper an alternative metric for GHSOMs where both symbolic and numerical data are considered is proposed. The implemented IDS using the GHSOM, was trained with the KDD Cup 1999 benchmark data set [8]. This data set has served as the first and only reliable benchmark data set that has been used for most of the research work on intrusion detection algorithms [9]. This data set includes a wide variety of simulated attacks. The remainder of this paper is organized as follows. Section 2 discusses the new GHSOM model used to build the IDS. Then, the training algorithm is described. In Section 3, we show some experimental results obtained from comparing our implementation of the Intrusion Detection System (IDS) with other related works, where data from the KDD Cup 1999 benchmark are used. Section 4 concludes this paper. 2 GHSOM for Intrusion Detection The IDS implemented is based on GHSOM. Initially, the GHSOM consists of a single SOM of 2x2 neurons. Then, the GHSOM adapts its very architecture depending on the input patterns, so that the GHSOM structure mirrors the structure of the input data getting a good data representation. This neural network structure is used to classify the input data in groups, where each neuron represents a data group with similar features. The level of data representation of each neuron is measured as the quantization error of the neuron ( ). The is a measure of the similarity of a data set, where the higher is the , the higher is the heterogeneity of the data. Usually the has been used in terms of the Euclidean distance between the input pattern and the weight vector of a neuron. However, in many real life problems not only numerical features are present, but also symbolic features can be found. To take an obvious example, among the features to analyze from data for building an IDS, three symbolic features are found: protocol type (UDP and ICMP), service (i.e. HTTP, SMTP, FTP, TELNET, An Intrusion Detection System Based on Hierarchical Self-Organization 141 etc.) and flag of the status of the connection (i.e. SF, S1, REJ, etc.). Unlike numerical data, symbolic data do not have an order associated and cannot be measured by a distance. It makes no sense to use the Euclidean distance between two symbolic values, for example the distance between HTTP and FTP protocol. It seems better to use a similarity measure rather than a distance measure for symbolic data. For that reason, in this paper we introduce the entropy as similarity measure of error in representation of a neuron for symbolic data together with the Euclidean distance for numerical data. Fig. 1. Sample architecture of a GHSOM Let be the th input pattern, where is the vector component of nu- is the component of symbolic features. The error of a unit merical features and ( ) in the representation is defined as follows: C w x p x log p x , (1) and are the error components of numerical and symbolic features, rewhere is the probability spectively, is the set of patterns mapped onto the unit , and of the element in . The quantization error of the unit is given by expression (2). | |. (2) First of all, the quantization error at layer 0 map ( ) has to be computed as it is shown above. In this case, the error neuron is computed as specified in (1), but using inas the mean of the all input data , being the set of input patterns the set . stead The training of a map is done as the training of a single SOM [6]. An input pattern is randomly selected and each neuron determines its activation according to a similarity measure. In our case, since we take into account the presence of symbolic and 142 E.J. Palomo et al. numerical data, the neuron with the smallest similarity measure defined in (3), becomes the winner. 1, 2 . log (3) For numerical component, is the Euclidean distance between two vectors. For symbolic data, it checks whether the two vectors are the same or not, that is, the probability can just take the values 1, if the vectors are the same; or 0.5 if they are differand are equal; and ent. Therefore, for symbolic data the value of will be 0, if if they are not the same. By taking into account the new similarity measure, the index of the winner neuron is defined in (4). min| , |. (4) is adapted according to the expression (5). For The weight vector of a neuron numerical component, the winner and its neighbors, whose amount of adaptation follows a Gaussian neighborhood function , are adpated. For symbolic data, just the winner is adapted with the weight vector of the mode of the set of input patterns mapped onto the winner . 1 1 1 (5) The GHSOM growing and expansion is controlled by means of two parameters: , which is used to control the growth of a single map; and , which is used to control the expansion of each neuron of the GHSOM. Specifically, a map stops growing if ( ) reaches a certain fraction of the of the corthe mean of the map's responding neuron that was expanded in the map . Also, a neuron is not expanded of . Thus, the if its quantization error ( ) is smaller than a certain fraction larger the paremeter is chosen the deeper the hierarchy will be. Also, for large values, we will have large maps. This way, with these two paremeters, a control of the resulting hierarchical architecture is provided. Note that these parameters are the only ones that have to be established in advance. The pseudocode of the training algorithm of the proposed GHSOM is defined as follows. Step 1. Compute the mean of the all input data and then, the initial quantization error . Step 2. Create an initial map with 2x2 neurons. Step 3. Train the map during iterations as a single SOM, using the expressions (3), (4) and (5). Step 4. Compute the quantification errors of each neuron according to the expression (2). Step 5. Calculate the mean of all units’ quantization errors of the map ( . Step 6. If go to step 9, where is the of the corresponding unit in the upper layer that was expanded. For the first layer, this is . An Intrusion Detection System Based on Hierarchical Self-Organization Step 7. Select the neuron with the highest according to the expression (6). arg max| , | 143 and its most dissimilar neighbor Λ (6) Step 8. Insert a row or column of neurons between and , initializing their weight vectors as the means of their respective neighbors. Go to step 3. Step 9. If for all neurons in the map, go to step 11. Step 10. Select an unsatisfied neuron and expand it creating a new map in the lower layer. The parent of the new map is the expanded neuron and their weight vectors are initialized as the mean of their parent and neighbors. Go to step 2. Step 11. If exists remaining maps, select one and go to step 3. Otherwise, the algorithm ends. 3 Experimental Results Our IDS based on the new GHSOM model was trained with the pre-processed KDD Cup 1999 benchmark data set created by MIT Lincoln Laboratory. The purpose of this benchmark was to build a network intrusion detector capable of distinguishing between intrusions or attacks, and normal connections. We have used the 10% KDD Cup 1999 benchmark data set, which contains 494021 connection records, each of them with 41 features. Here, a connection is a sequence of TCP packets which flows between a source and a target IP addresses. Since some features alone cannot constitute a sign of an anomalous activity, it is better analyzes connection records rather than individual packets. Among the 41 features, three are symbolic: protocol type, service and status of the connection flag. In the training data set exist 22 attack types and in addition to normal records, which fall into four main categories [10]: Denial of Service (DoS), Probe, Remote-to-Local (R2L) and User-to-Root (U2R). In this paper, we have selected two data subsets for training the GHSOM from the total of 494021 connection records, SetA with 100000 connection records and SetB with 169000. Both SetA and SetB contain the 22 attack types. We try to select the data in such a way that there was the same distribution for all the record types. However, the distribution of the data in the 10% KDD Cup data set has an irregular distribution that finally was mirrored in our selection. The two data subsets were trained with 0.1 as value for parameters and , since with these values we achieved good results and a very simple architecture. In fact, each trained GHSOM generated just Table 1. Training results for SetA and SetB Training Set Detected (%) False Positive (%) Identified (%) SetA 99.98 3.03 94.99 SetB 99.98 5.74 95.09 two layers with 16 neurons, although with a different arrangement. The GHSOM trained with SetB is the same that we showed in Fig. 1. 144 E.J. Palomo et al. Many related works are only interested in classifying the input pattern just as one of two record types: anomalous or normal records. Taking into account just two groups, normal records that are classified as anomalous are known as false positives, whereas anomalous records that are classified as normal records are known as missed detections. However, we are also interested in classify an anomalous record into its attack type, that is, taking into account 23 groups (22 attack types plus normal records) instead 2 groups. Hence, we call identification rate to the connection records that are correctly identified as their respective record types. The training results of the two GHSOMs obtained with the subsets SetA and SetB are shown in Table 1. Both subsets achieve 99.98% attack detection rate and false positive rates of 3.03% and 5.74%, respectively. Attending to the identification of the attack type, around the 95% were correctly identified in both cases. We have simulated the trained GHSOMs with the 494021 connection records from the benchmark data set. This simulation consists of classifying these data with the trained GHSOMs without any modification of the GHSOMs, that is, without learning process. The simulation results of both GHSOMs are given in Table 2. Here, 99.9% detection rate is achieved. Also, the identification rate rises up to 97% in both cases. The false positive rate increases for SetA during the simulation, although it is lower for SetB compared with this rate after the training. Note that during both training and simulation, we used the 41 features and all the 22 existing attacks in the training data set. Moreover, the resulting number of neurons was much reduced. In fact, there are less neurons than groups, although this is due to the scarce amount of connection records from certain attack groups. Table 2. Simulating results with 494021 records for SetA and SetB Training Set Detected (%) False Positive (%) Identified (%) SetA 99.99 5.41 97.09 SetB 99.91 5.11 97.15 In Table 3, we compare the training results of SetA with the results obtained in [9, 10], where SOMs were used to build IDSs as well. In order to differentiate both IDS based on self-organization, we call them SOM and K-Map, respectively, using the author's notation, whereas our trained IDS is called GHSOM. From the first work, we chose the only one SOM trained on all the 41 features, which was composed of 20x20 neurons. Another IDS implementation was proposed in the second work. Here, a hierarchical Kohonen net (K-Map), composed of three layers, was trained with a subset of 169000 connection records, and taking into account the 41 features and 22 attack types. Their best result was 99.63% detection rate after testing the K-Map, although with several restrictions. This result was achieved using a pre-selected combination of 20 features, which were divided into three levels of features, where each features sub combination was fed to a different layer. Also, they used just three attack types during the testing, whereas we used the 22 attack types. Moreover, the architecture of the hierarchical K-Map was established in advance, using 48 neurons in each layer, that is, 144 neurons, when we used just 16 neurons that were generated during the training process without any human intervention. An Intrusion Detection System Based on Hierarchical Self-Organization 145 Table 3. Comparison results for different IDS implementations GHSOM SOM K-Map Detected (%) False Positive (%) 99.98 3.03 81.85 0.03 99.63 0.34 4 Conclusions This paper has presented a novel Intrusion Detection System based on growing hierarchical self-organizing maps (GHSOMs). These neural networks are composed of several SOMs arranged in layers, where the number of layers, maps and neurons are established during the training process mirroring the inherent data structure. Moreover, we have taken into account the presence of symbolic features in addition to numerical features in the input data. In order to improve the classification process of input data, we have introduced a new metric for GHSOMs based on entropy for symbolic data together with the Euclidean distance for numerical data. We have used the 10% KDD Cup 1999 benchmark data set to train our IDS based on GHSOM, which contains 494021 connection records, where 22 attack types in addition to normal records can be found. We trained and simulated two GHSOMs with two different subsets, SetA with 100000 connection records and SetB with 169000 connection records. Both SetA and SetB achieved 99.98% detection rate and false positives rates of 3.03% and 5.74%, respectively. The identification rate, that is, the connection records identified as their correct connection record types, was 94.99% and 95.09%, respectively. After the simulation of the two trained GHSOMs with the 494021 connection records, we achieved 99.9% detection rate and false positive rates between 5.11% and 5.41% in both subsets. The identification rate was around the 97%. Acknowledgements. This work is partially supported by the Spanish Ministry of Education and Science under contract TIN-07362. References 1. Denning, D.: An intrusion-detection model. Software Engineering. IEEE Transactions on SE 13(2), 222–232 (1987) 2. Lee, W., Stolfo, S., Chan, P., Eskin, E., Fan, W., Miller, M., Hershkop, S., Zhang, J.: Real time data mining-based intrusion detection. In: DARPA Information Survivability Conference & Exposition II, vol. 1, pp. 89–100 (2001) 3. Maxion, R., Tan, K.: Anomaly detection in embedded systems. IEEE Transactions on Computers 51(2), 108–120 (2002) 4. Tan, K., Maxion, R.: Determining the operational limits of an anomaly-based intrusion detector. IEEE Journal on Selected Areas in Communications 21(1), 96–110 (2003) 5. Ying, H., Feng, T.J., Cao, J.K., Ding, X.Q., Zhou, Y.H.: Research on some problems in the kohonen som algorithm. In: International Conference on Machine Learning and Cybernetics, vol. 3, pp. 1279–1282 (2002) 146 E.J. Palomo et al. 6. Kohonen, T.: Self-organized formation of topologically correct feature maps. Biological cybernetics 43(1), 59–69 (1982) 7. Fritzke, B.: Growing grid - a self-organizing network with constant neighborhood range and adaptation strength. Neural Processing Letters 2(5), 9–13 (1995) 8. Lee, W., Stolfo, S., Mok, K.: A data mining framework for building intrusion detection models. In: IEEE Symposium on Security and Privacy, pp. 120–132 (1999) 9. Sarasamma, S., Zhu, Q., Hu, J.: Hierarchical kohonenen net for anomaly detection in network security. IEEE Transactions on Systems Man and Cybernetics Part BCybernetics 35(2), 302–312 (2005) 10. DeLooze, L., DeLooze, A.F.: Attack characterization and intrusion detection using an ensemble of self-organizing maps. In: 7th Annual IEEE Information Assurance Workshop, pp. 108–115 (2006)
© Copyright 2025 Paperzz