Evaluating Sequential Combination of Two Genetic Algorithm-Based Solutions for Intrusion Detection Zorana Banković, Slobodan Bojanić, and Octavio Nieto-Taladriz ETSI Telecomunicación, Universidad Politécnica de Madrid, Ciudad Universitaria s/n, 28040 Madrid, Spain {zorana,slobodan,nieto}@die.upm.es Abstract. The paper presents a serial combination of two genetic algorithm-based intrusion detection systems. Feature extraction techniques are deployed in order to reduce the amount of data that the system needs to process. The designed system is simple enough not to introduce significant computational overhead, but at the same time is accurate, adaptive and fast. There is a large number of existing solutions based on machine learning techniques, but most of them introduce high computational overhead. Moreover, due to its inherent parallelism, our solution offers a possibility of implementation using reconfigurable hardware with the implementation cost much lower than the one of the traditional systems. The model is verified on KDD99 benchmark dataset, generating a solution competitive with the solutions of the state-of-the-art. Keywords: intrusion detection, genetic algorithm, sequential combination, principal component analysis, multi expression programming. 1 Introduction Computer networks are usually protected against attacks by a number of access restriction policies (anti-virus software, firewall, message encryption, secured network protocols, password protection). Since these solutions are proven to be insufficient for providing high level of network security, there is a need for additional support in detecting malicious traffic. Intrusion detection systems (IDS) are placed inside the protected network, looking for potential threats in network traffic and/or audit data recorded by hosts. IDS have three common problems that should be tackled when designing a system of the kind: speed, accuracy and adaptability. The speed problem arises from the extensive amount of data that needs to be monitored in order to perceive the entire situation. An existing approach to solving this problem is to split network stream into more manageable streams and analyze each in real-time using separate IDS. The event stream must be split in a way that covers all relevant attack scenarios, but this assumes that all attack scenarios must be known a priori. We are deploying a different approach. Instead of defining different attack scenarios, we extract the features of network traffic that are likely to take part in an attack. This provides higher flexibility since a feature can be relevant for more than one attack or is prone to be abused by an unknown attack. Moreover, we need only one IDS E. Corchado et al. (Eds.): CISIS 2008, ASC 53, pp. 147–154, 2009. springerlink.com © Springer-Verlag Berlin Heidelberg 2009 148 Z. Banković, S. Bojanić, and O. Nieto-Taladriz to perform the detection. Finally, in this way the total amount of data to be processed is highly reduced. Hence, the amount of time spent for offline training of the system and afterwards the time spent for attacks detection are also reduced. Incorporation of learning algorithms provides a potential solution for the adaptation and accuracy issues. A great number of machine learning techniques have been deployed for intrusion detection in both commercial systems and state-of-the-art. These techniques can introduce certain amount of ‘intelligence’ in the process of detecting attacks, but are capable of processing large amount of data much faster than a human. However, these systems can introduce high computational overhead. Furthermore, many of them do not deal properly with so-called ‘rare’ classes [5], i.e. the classes that have significantly smaller number of elements then the rest of the classes. This problem occurs mostly due to the tendency for generalization that most of these techniques exhibit. Intrusions can be considered rare classes since the amount of intrusive traffic is considerably smaller then the amount of normal traffic. Thus, we need a machine learning technique that is capable of dealing with this issue. In this work we are presenting a genetic algorithm (GA) [9] approach for classifying network connections. GAs are robust, inherently parallel, adaptable and suitable for dealing with the classification of rare classes [5]. Moreover, due to its inherent parallelism, it offers possibility to implement the system using reconfigurable devices without the need of deploying a microprocessor. In this way, the implementation cost would be much lower than the cost of implementing traditional IDS offering at the same time higher level of adaptability. This work represents a continuation of our previous one [1] where we investigated the possibilities of applying GA to intrusion detection while deploying small subset of features. The experiments have confirmed the robustness of GA and inspired us to further continue experimenting on the subject. Here we further investigate a combination of two GA-based intrusion detection systems. The first system in the row is an anomaly-based IDS implemented as a simple linear classifier. This system exhibits high false-positive rate. Thus, we have added a simple system based on if-then rules that filters the decision of the linear classifier and in that way significantly reduces false-positive rate. We actually create a strong-classifier built upon weak-classifiers, but without the need to follow the process of boosting algorithm [15] as both of the created systems can be trained separately. For evolving our GA-based system KDD99Cup training and testing dataset was used [6]. KDD99Cup dataset was found to have quite drawbacks [7], [8], but it is still prevailing dataset used for training and testing of IDS due to its good structure and availability [3], [4]. Because of these shortcomings, the presented results do not illustrate the behavior of the system in a real-world environment, but it does reflect its possibilities. In general case, the performance of the implemented system highly depends on the training data. In the following text Sections 2 gives the survey and comparison of machine learning techniques deployed for intrusion detection. Section 3 details the implementation if the system. Section 4 presents the benchmark KDD99 training and testing dataset, Evaluating Sequential Combination of Two Genetic Algorithm-Based Solutions 149 evaluates the performance of the system using this dataset and discusses the results. Finally, the conclusions are drawn in Section 5. 2 Machine Learning Techniques for Intrusion Detection – Survey and Comparison In the recent past there has been a growing recognition of intelligent techniques for the construction of efficient and reliable intrusion detection systems. A complete survey of these techniques is hard to present at this point, since there are more than hundred IDS based on machine learning techniques. Some of the best-performed techniques used in the state-of-the-art apply GA [4], combination of neural networks and C4.5 [11], genetic programming (GP) ensemble [12], support vector machines [13] or fuzzy logic [14]. All of the named techniques have two steps: training and testing. The systems have to be constantly retrained using new data since new attacks are emerging every day. The advantage of all GA or GP-based techniques lies in their easy retraining. It’s enough to use the best population evolved in the previous iteration as initial population and repeat the process, but this time including new data. Thus, our system is inherently adaptive which is an imperative quality of an IDS. Furthermore, GAs are intrinsically parallel, since they have multiple offspring, they can explore the solution space in multiple directions at once. Due to the parallelism that allows them to implicitly evaluate many schemas at once, GAs are particularly well-suited to solving problems where the space of all potential solutions is too vast to search exhaustively in any reasonable amount of time, as network data. GA-based techniques are appropriate for dealing with rare classes. As they work with populations of candidate solutions rather than a single solution and employ stochastic operators to guide the search process, GAs cope well with attribute interactions and avoid getting stuck in local maxima, which together make them very suitable for dealing with classifying rare classes [5]. We have gone further by deploying standard F-measure as fitness function. F-value is proven to be very suitable when dealing with rare classes [5]. F-measure is a combination of precision and recall. Rare cases and classes are valued when using these metrics because both precision and recall are defined with respect to a rare class. None of the GA or GP techniques stated above considers the problem of rare classes. A technique that considers the problem of rare classes is given in [15]. Their solution is similar to ours in the sense that they deploy a boosting technique, which also assumes creating a strong classifier by combining several weak classifiers. Furthermore, they present the results in the terms of F-measure. The advantage of our system is that we can train the parts of our system independently, while boosting algorithm trains its parts one after another. Finally, if we want to consider the possibility of hardware implementation using reconfigurable hardware, not all the systems are appropriate due to their sequential nature or high computational complexity. Due to the parallelism of our algorithm a hardware implementation using reconfigurable devices is possible. This can lead to lower implementation cost with higher level of adaptability compared to the existing solutions and reduced amount of time for system training and testing. 150 Z. Banković, S. Bojanić, and O. Nieto-Taladriz In short, the main advantage of our solution lies in the fact that it includes important characteristics (high accuracy and performance, dealing with rare classes, inherent adaptability, feasibility of hardware implementation) in one solution. We are not familiar with any existing solution that would cover all the characteristics mentioned above. 3 System Implementation The implemented IDS is a serial combination of two IDSs. The complete system is presented in Fig. 1. The first part is a linear classifier that classifies connections into normal ones and potential attacks. Due to its very low false-negative rate, the decision that it makes on normal connections is considered correct. But, as it exhibits high false-positive rate, if it opts for an attack, it has to be re-checked. This re-checking is performed by a rule-based system whose rules are trained to recognize normal connections. This part of the system exhibits very low false-positive rate, i.e. the probability for an attack to be incorrectly classified as a normal connection is very low. In this way, the achieved false-positive rate of the entire system is significantly reduced while maintaining high detection rate. As our system is trained and tested on KDD99 dataset, the election of the most important features is performed once at the beginning of the process. Implementation for a real-world environment, however, would require performing the feature selection process before each process of training. Fig. 1. Block Diagram of the Complete System The linear classifier is based on a linear combination of three features. The features are identified as those that have the highest possibility to take part in an attack by deploying PCA [1]. The details of PCA algorithm are explained in [16]. The selected features and their explanations are presented in Table 1. Table 1. The features used to describe the attacks Name of the feature duration src_bytes dst_host_srv_serror_rate Explication length (number of seconds) of the connection number of data bytes from source to destination percentage of connections that have “SYN” errors The linear classifier is evolved using GA algorithm [9]. Each chromosome, i.e. potential solution to the problem, in the population consists of four genes, where the first Evaluating Sequential Combination of Two Genetic Algorithm-Based Solutions 151 three represent coefficients of the linear classifier and the fourth one the threshold value. The decision about the current connection is made according to the formula (1): gene(1)*con(duration)+gene(2)*con(src_bytes)+ gene(3)*con(dst_host_srv_serror_rate)<gene(4) (1) where con(duration), con(src_bytes) and con(dst_host_srv_serror_rate) are the values of the duration, src_bytes and dst_host_srv_serror_rate feature of the current connection. The linear classifier is trained using incremental GA [9]. The population contains 1000 individuals which were trained during 300 generations. The mutation rate was 0.1 while the crossover rate was 0.9. The previous numbers were chosen after number of experiments. Increasing the size of the population and the number of generations stopped when it not bring significant performance improvement. The type of crossover deployed was uniform crossover, i.e. a new individual had equal chances to contain either of the genes of both of its parents. The performance measurement, i.e. the fitness function, was the squared percentage of the correctly classified connections, i.e. according to the formula: ⎛ count ⎞ ⎟⎟ fitness = ⎜⎜ ⎝ numOfCon ⎠ 2 (2) where count is the number of correctly classified connections, while numOfCon is the number of connections in the training dataset. The squared percentage was chosen rather than the simple percentage value because the achieved results were better. The result of this GA was its best individual which forms the first part of the system presented in Fig.1. The second part of the system (Fig. 1) is a rule-based system, where simple if-then rules for recognizing normal connections are evolved. The most important features were taken over from the results obtained in [2] using Multi Expression Programming (MEP). The features and their explanations are listed in Table 2. Table 2. The features used to describe normal connections Name of the feature service hot logged in Explication Destination service (e.g. telnet, ftp) number of hot indicators 1 if successfully logged in; 0 otherwise An example of a rule can be the following one: if (service=”http” and hot=”0” and logged_in=”0”) then normal; The rules are trained using incremental GA with the same parameters used for the linear classifier. Each 3-gene chromosome represents a rule, where the value of each gene is the value of its corresponding feature. But, the population used in this case contained 500 individuals, as no improvements were achieved with larger populations. The result of the training was a set of 200 best-performing rules. The fitness function in this case was the F-value with the parameter 0.8: 152 Z. Banković, S. Bojanić, and O. Nieto-Taladriz fitness = 1.8 * recall * precision TP TP , precision = , recall = 0.8 * precision + recall TP + FP TP + FN (3) where TP, FP and FN stand for the number of true positives, false positives and false negatives respectively. The system presented here was implemented in C++ programming language. The software for this work used the GAlib genetic algorithm package, written by Matthew Wall at the Massachusetts Institute of Technology [10]. The time of training the implemented system is 185 seconds while the testing process takes 45 seconds. The reason for short time of training lies in deploying incremental GA whose population is not big all the time, i.e. it is growing after each iteration. The system was demonstrated on AMD Athlon 64 X2 Dual Core Processor 3800+ with 1GB RAM memory on its disposal. 4 Results 4.1 Training and Testing Datasets The dataset contains 5000000 network connection records. A connection is a sequence of TCP packets starting and ending at defined times, between which data flows from a source IP to a target IP under certain protocol [6]. The training portion of the dataset ( “kdd_10_percent”) contains 494021 connections of which 20% are normal (97277), and the rest (396743) are attacks. Each connection record contains 41 independent fields and a label (normal or type of attack). Attacks belong to the one of the four attack categories: user to root, remote to local, probe, and denial of service. The testing dataset (“corrected”) provides a dataset with a significantly different statistical distribution than the training dataset (250436 attacks and 60593 normal connections) and contains an additional 14 attacks not included in the training dataset. The most important flaws of the mentioned dataset are given in the following [7]. The dataset contains biases that may be reflected in the performance of the evaluated systems, for example the skewed nature of the attack distribution. None of the sources explaining the dataset contains any discussion of data rates, and its variation with time is not specified. There is no discussion of whether the quantity of data presented is sufficient to train a statistical anomaly system or a learning-based system. Furthermore, in [8] is demonstrated that the transformation model used for transforming raw DARPA’s network data to a well-featured data item set is ‘poor’. Here ‘poor’ refers to the fact that some attribute values are the same in different data items that have different class labels. Due to this, some of the attacks can’t be classified correctly. 4.2 Obtained Rates The system was trained using “kdd_10_percent” and tested on “corrected” dataset. The obtained results are summarized in Table 3. The last column gives the value of classical F-measure so that learning results could be easily compared with a unique feature for both recall and precision. The false-positive rate is reduced from 40.7% to Evaluating Sequential Combination of Two Genetic Algorithm-Based Solutions 153 1.4%, while the detection rate has reduced for only 0.15%. The increasing of F-value is also exhibited. Table 3. The performance of the whole system and its parts separately System Linear Classifier Rule-based Whole system Detection rate Num. Per.(%) 231030 92.25 45504 75.1 230625 92.1 False Positive Rate Num. Per.(%) 24628 40.7 5537 2.2 862 1.4 F-measure 0.913 0.815 0.96 The adaptability of the system was tested as well. At first, the system was trained with a subset of “kdd_10_percent” (250000 connections out of 491021). The generated rules were taken as the initial generation and re-trained with the remaining data of “kdd_10_percent” dataset. Both of the systems were tested on “corrected” dataset. The system exhibited improvements in both detection and false positive rate. The improvements are presented in the Table 4. Table 4. The performance of the system after re-training System Trained with a subset Re-trained with the rest of the training data Detection rate Num. Per.(%) 183060 73.1 231039 92.3 False Positive rate Num. Per.(%) 1468 2.4 862 1.4 Fmeasure 0.84 0.96 The drawbacks of the dataset have influenced the gained rates. As a comparison, the detection rate of the system tested on the same data that it was trained on, i.e. “kdd_10_percent”, is 99.2% comparing to the detection rate of 92.1% after testing the system using “corrected” dataset. Thus, the dataset deficiencies stated previously in this section had negative effects on the rates obtained in this work. 5 Conclusions In this work a novel approach consisting in a serial combination of two GA-based IDSes is introduced. The properties including adaptability of the resulting system were analyzed. The resulting system exhibits very good characteristics in the terms of both detection and false-positive rate and the F-measure. The implementation of the system has been performed in a way that corresponds well to the deployed dataset mostly in the terms of the chosen features. In a real system this does not have to be the case. Thus, an implementation of the system for a real-world environment has to be adjusted in the sense that the set of the chosen features has to be changed according to the environmental changing. Due to the inherent high parallelism of the presented system, there is a possibility of its implementation using reconfigurable hardware. This will result in a highperformance real-word implementation with considerably lower implementation cost, 154 Z. Banković, S. Bojanić, and O. Nieto-Taladriz size and power consumption compared to the existing solutions. Part of the future work will consist in pursuing hardware implementation of the presented system. Acknowledgements. This work has been partially funded by the Spanish Ministry of Education and Science under the project TEC2006-13067-C03-03 and by the European Commission under the FastMatch project FP6 IST 27095. References 1. Banković, Z., Stepanović, D., Bojanić, S., Nieto-Taladriz, O.: Improving Network Security Using Genetic Algorithm Approach. Computers & Electrical Engineering 33(5-6), 438–451 2. Grosan, C., Abraham, A., Chis, M.: Computational Intelligence for light weight intrusion detection systems. In: International Conference on Applied Computing (IADIS 2006), San Sebastian, Spain, pp. 538–542 (2006); ISBN: 9728924097 3. Gong, R.H., Zulkernine, M., Abolmaesumi, P.: A Software Implementation of a Genetic Algorithm Based Approach to Network Intrusion Detection. In: Proceedings of SNPD/SAWN 2005 (2005) 4. Chittur, A.: Model Generation for an Intrusion Detection System Using Genetic Algorithms (accessed in 2006), http://www1.cs.columbia.edu/ids/ publications/gaids-thesis01.pdf 5. Weiss, G.: Mining with rarity: A unifying framework. SIGKDD Explorations 6(1), 7–19 (2004) 6. http://kdd.ics.uci.edu/ (October 1999) 7. McHugh, J.: Testing Intrusion Detection Systems: A Critique of the 1998 and 1999 DARPA IDS Evaluation as Performed by Lincoln Laboratory. ACM Trans. on Information and System security 3(4), 262–294 (2000) 8. Bouzida, Y., Cuppens, F.: Detecting known and novel network intrusion. In: IFIP/SEC 2006 21st International Information Security Conference, Karlstad, Sweden (2006) 9. Goldberg, D.E.: Genetic algorithms for search, optimization, and machine learning. Addison-Wesley, Reading (1989) 10. GAlib, A.: C++ Library of Genetic Algorithm Components, http://lancet.mit.edu/ga/ 11. Pan, Z., Chen, S., Hu, G., Zhang, D.: Hybrid Neural Network and C4.5 for Misuse Detection. In: Proceedings of the Second International Conference on Machine Learning and Cybernetics, November 2003, vol. 4, pp. 2463–2467 (2003) 12. Folino, G., Pizzuti, C., Spezzano, G.: GP Ensemble for Distributed Intrusion Detection Systems. In: Singh, S., Singh, M., Apte, C., Perner, P. (eds.) ICAPR 2005. LNCS, vol. 3686. Springer, Heidelberg (2005) 13. Laskov, P., Düssel, P., Schäfer, C., Rieck, K.: Learning Intrusion Detection: Supervised or Unsaupervised? In: Roli, F., Vitulano, S. (eds.) ICIAP 2005. LNCS, vol. 3617, pp. 50–57. Springer, Heidelberg (2005) 14. Yao, J.T., Zhao, S.L., Saxton, L.V.: A Study on Fuzzy Intrusion Detection. Data mining, intrusion detection, information assurance and data networks security (2005) 15. Chawla, N.V., Lazarevic, A., Hall, L.O., Bowyer, K.: SMOTEBoost: Improving prediction of the minority class in boosting. In: Proceedings of Principles of Knowledge Discovery in Databases (2003) 16. Fodor, I.K.: A Survey of Dimension Reduction Techniques, http://llnl.gov/CASC/sapphire/pubs
© Copyright 2025 Paperzz