00530147.pdf

Evaluating Sequential Combination of Two Genetic
Algorithm-Based Solutions for Intrusion Detection
Zorana Banković, Slobodan Bojanić, and Octavio Nieto-Taladriz
ETSI Telecomunicación, Universidad Politécnica de Madrid, Ciudad Universitaria s/n,
28040 Madrid, Spain
{zorana,slobodan,nieto}@die.upm.es
Abstract. The paper presents a serial combination of two genetic algorithm-based intrusion
detection systems. Feature extraction techniques are deployed in order to reduce the amount of
data that the system needs to process. The designed system is simple enough not to introduce
significant computational overhead, but at the same time is accurate, adaptive and fast. There is
a large number of existing solutions based on machine learning techniques, but most of them
introduce high computational overhead. Moreover, due to its inherent parallelism, our solution
offers a possibility of implementation using reconfigurable hardware with the implementation
cost much lower than the one of the traditional systems. The model is verified on KDD99
benchmark dataset, generating a solution competitive with the solutions of the state-of-the-art.
Keywords: intrusion detection, genetic algorithm, sequential combination, principal component analysis, multi expression programming.
1 Introduction
Computer networks are usually protected against attacks by a number of access restriction policies (anti-virus software, firewall, message encryption, secured network
protocols, password protection). Since these solutions are proven to be insufficient for
providing high level of network security, there is a need for additional support in
detecting malicious traffic. Intrusion detection systems (IDS) are placed inside the
protected network, looking for potential threats in network traffic and/or audit data
recorded by hosts.
IDS have three common problems that should be tackled when designing a system
of the kind: speed, accuracy and adaptability. The speed problem arises from the extensive amount of data that needs to be monitored in order to perceive the entire situation. An existing approach to solving this problem is to split network stream into more
manageable streams and analyze each in real-time using separate IDS. The event
stream must be split in a way that covers all relevant attack scenarios, but this assumes that all attack scenarios must be known a priori.
We are deploying a different approach. Instead of defining different attack scenarios, we extract the features of network traffic that are likely to take part in an attack.
This provides higher flexibility since a feature can be relevant for more than one attack or is prone to be abused by an unknown attack. Moreover, we need only one IDS
E. Corchado et al. (Eds.): CISIS 2008, ASC 53, pp. 147–154, 2009.
springerlink.com
© Springer-Verlag Berlin Heidelberg 2009
148
Z. Banković, S. Bojanić, and O. Nieto-Taladriz
to perform the detection. Finally, in this way the total amount of data to be processed
is highly reduced. Hence, the amount of time spent for offline training of the system
and afterwards the time spent for attacks detection are also reduced.
Incorporation of learning algorithms provides a potential solution for the adaptation and accuracy issues. A great number of machine learning techniques have been
deployed for intrusion detection in both commercial systems and state-of-the-art.
These techniques can introduce certain amount of ‘intelligence’ in the process of
detecting attacks, but are capable of processing large amount of data much faster than
a human. However, these systems can introduce high computational overhead. Furthermore, many of them do not deal properly with so-called ‘rare’ classes [5], i.e. the
classes that have significantly smaller number of elements then the rest of the classes.
This problem occurs mostly due to the tendency for generalization that most of these
techniques exhibit. Intrusions can be considered rare classes since the amount of intrusive traffic is considerably smaller then the amount of normal traffic. Thus, we
need a machine learning technique that is capable of dealing with this issue.
In this work we are presenting a genetic algorithm (GA) [9] approach for classifying network connections. GAs are robust, inherently parallel, adaptable and suitable
for dealing with the classification of rare classes [5]. Moreover, due to its inherent
parallelism, it offers possibility to implement the system using reconfigurable devices
without the need of deploying a microprocessor. In this way, the implementation cost
would be much lower than the cost of implementing traditional IDS offering at the
same time higher level of adaptability.
This work represents a continuation of our previous one [1] where we investigated
the possibilities of applying GA to intrusion detection while deploying small subset of
features. The experiments have confirmed the robustness of GA and inspired us to
further continue experimenting on the subject.
Here we further investigate a combination of two GA-based intrusion detection
systems. The first system in the row is an anomaly-based IDS implemented as a simple linear classifier. This system exhibits high false-positive rate. Thus, we have
added a simple system based on if-then rules that filters the decision of the linear
classifier and in that way significantly reduces false-positive rate. We actually create a
strong-classifier built upon weak-classifiers, but without the need to follow the process of boosting algorithm [15] as both of the created systems can be trained
separately.
For evolving our GA-based system KDD99Cup training and testing dataset was
used [6]. KDD99Cup dataset was found to have quite drawbacks [7], [8], but it is still
prevailing dataset used for training and testing of IDS due to its good structure and
availability [3], [4]. Because of these shortcomings, the presented results do not illustrate the behavior of the system in a real-world environment, but it does reflect its
possibilities. In general case, the performance of the implemented system highly depends on the training data.
In the following text Sections 2 gives the survey and comparison of machine learning techniques deployed for intrusion detection. Section 3 details the implementation
if the system. Section 4 presents the benchmark KDD99 training and testing dataset,
Evaluating Sequential Combination of Two Genetic Algorithm-Based Solutions
149
evaluates the performance of the system using this dataset and discusses the results.
Finally, the conclusions are drawn in Section 5.
2 Machine Learning Techniques for Intrusion Detection – Survey
and Comparison
In the recent past there has been a growing recognition of intelligent techniques for
the construction of efficient and reliable intrusion detection systems. A complete
survey of these techniques is hard to present at this point, since there are more than
hundred IDS based on machine learning techniques. Some of the best-performed
techniques used in the state-of-the-art apply GA [4], combination of neural networks
and C4.5 [11], genetic programming (GP) ensemble [12], support vector machines
[13] or fuzzy logic [14].
All of the named techniques have two steps: training and testing. The systems have
to be constantly retrained using new data since new attacks are emerging every day.
The advantage of all GA or GP-based techniques lies in their easy retraining. It’s
enough to use the best population evolved in the previous iteration as initial population and repeat the process, but this time including new data. Thus, our system is
inherently adaptive which is an imperative quality of an IDS.
Furthermore, GAs are intrinsically parallel, since they have multiple offspring,
they can explore the solution space in multiple directions at once. Due to the parallelism that allows them to implicitly evaluate many schemas at once, GAs are particularly well-suited to solving problems where the space of all potential solutions is too
vast to search exhaustively in any reasonable amount of time, as network data.
GA-based techniques are appropriate for dealing with rare classes. As they work
with populations of candidate solutions rather than a single solution and employ stochastic operators to guide the search process, GAs cope well with attribute interactions and avoid getting stuck in local maxima, which together make them very
suitable for dealing with classifying rare classes [5]. We have gone further by deploying standard F-measure as fitness function. F-value is proven to be very suitable when
dealing with rare classes [5]. F-measure is a combination of precision and recall. Rare
cases and classes are valued when using these metrics because both precision and
recall are defined with respect to a rare class. None of the GA or GP techniques stated
above considers the problem of rare classes.
A technique that considers the problem of rare classes is given in [15]. Their solution is similar to ours in the sense that they deploy a boosting technique, which also
assumes creating a strong classifier by combining several weak classifiers. Furthermore, they present the results in the terms of F-measure. The advantage of our system
is that we can train the parts of our system independently, while boosting algorithm
trains its parts one after another.
Finally, if we want to consider the possibility of hardware implementation using
reconfigurable hardware, not all the systems are appropriate due to their sequential
nature or high computational complexity. Due to the parallelism of our algorithm a
hardware implementation using reconfigurable devices is possible. This can lead to
lower implementation cost with higher level of adaptability compared to the existing
solutions and reduced amount of time for system training and testing.
150
Z. Banković, S. Bojanić, and O. Nieto-Taladriz
In short, the main advantage of our solution lies in the fact that it includes important characteristics (high accuracy and performance, dealing with rare classes, inherent adaptability, feasibility of hardware implementation) in one solution. We are not
familiar with any existing solution that would cover all the characteristics mentioned
above.
3 System Implementation
The implemented IDS is a serial combination of two IDSs. The complete system is
presented in Fig. 1. The first part is a linear classifier that classifies connections into
normal ones and potential attacks. Due to its very low false-negative rate, the decision
that it makes on normal connections is considered correct. But, as it exhibits high
false-positive rate, if it opts for an attack, it has to be re-checked. This re-checking is
performed by a rule-based system whose rules are trained to recognize normal connections. This part of the system exhibits very low false-positive rate, i.e. the probability for an attack to be incorrectly classified as a normal connection is very low. In
this way, the achieved false-positive rate of the entire system is significantly reduced
while maintaining high detection rate. As our system is trained and tested on KDD99
dataset, the election of the most important features is performed once at the beginning
of the process. Implementation for a real-world environment, however, would require
performing the feature selection process before each process of training.
Fig. 1. Block Diagram of the Complete System
The linear classifier is based on a linear combination of three features. The features
are identified as those that have the highest possibility to take part in an attack by
deploying PCA [1]. The details of PCA algorithm are explained in [16]. The selected
features and their explanations are presented in Table 1.
Table 1. The features used to describe the attacks
Name of the feature
duration
src_bytes
dst_host_srv_serror_rate
Explication
length (number of seconds) of the connection
number of data bytes from source to destination
percentage of connections that have “SYN” errors
The linear classifier is evolved using GA algorithm [9]. Each chromosome, i.e. potential solution to the problem, in the population consists of four genes, where the first
Evaluating Sequential Combination of Two Genetic Algorithm-Based Solutions
151
three represent coefficients of the linear classifier and the fourth one the threshold value.
The decision about the current connection is made according to the formula (1):
gene(1)*con(duration)+gene(2)*con(src_bytes)+
gene(3)*con(dst_host_srv_serror_rate)<gene(4)
(1)
where con(duration), con(src_bytes) and con(dst_host_srv_serror_rate) are the values of the duration, src_bytes and dst_host_srv_serror_rate feature of the current
connection.
The linear classifier is trained using incremental GA [9]. The population contains
1000 individuals which were trained during 300 generations. The mutation rate was
0.1 while the crossover rate was 0.9. The previous numbers were chosen after number
of experiments. Increasing the size of the population and the number of generations
stopped when it not bring significant performance improvement. The type of crossover deployed was uniform crossover, i.e. a new individual had equal chances to
contain either of the genes of both of its parents. The performance measurement, i.e.
the fitness function, was the squared percentage of the correctly classified connections, i.e. according to the formula:
⎛ count ⎞
⎟⎟
fitness = ⎜⎜
⎝ numOfCon ⎠
2
(2)
where count is the number of correctly classified connections, while numOfCon is the
number of connections in the training dataset. The squared percentage was chosen
rather than the simple percentage value because the achieved results were better. The
result of this GA was its best individual which forms the first part of the system presented in Fig.1.
The second part of the system (Fig. 1) is a rule-based system, where simple if-then
rules for recognizing normal connections are evolved. The most important features
were taken over from the results obtained in [2] using Multi Expression Programming
(MEP). The features and their explanations are listed in Table 2.
Table 2. The features used to describe normal connections
Name of the feature
service
hot
logged in
Explication
Destination service (e.g. telnet, ftp)
number of hot indicators
1 if successfully logged in; 0 otherwise
An example of a rule can be the following one:
if (service=”http” and hot=”0” and logged_in=”0”)
then normal;
The rules are trained using incremental GA with the same parameters used for the
linear classifier. Each 3-gene chromosome represents a rule, where the value of each
gene is the value of its corresponding feature. But, the population used in this case
contained 500 individuals, as no improvements were achieved with larger populations. The result of the training was a set of 200 best-performing rules. The fitness
function in this case was the F-value with the parameter 0.8:
152
Z. Banković, S. Bojanić, and O. Nieto-Taladriz
fitness =
1.8 * recall * precision
TP
TP
, precision =
, recall =
0.8 * precision + recall
TP + FP
TP + FN
(3)
where TP, FP and FN stand for the number of true positives, false positives and false
negatives respectively.
The system presented here was implemented in C++ programming language. The
software for this work used the GAlib genetic algorithm package, written by Matthew
Wall at the Massachusetts Institute of Technology [10]. The time of training the implemented system is 185 seconds while the testing process takes 45 seconds. The
reason for short time of training lies in deploying incremental GA whose population is
not big all the time, i.e. it is growing after each iteration. The system was demonstrated on AMD Athlon 64 X2 Dual Core Processor 3800+ with 1GB RAM memory
on its disposal.
4 Results
4.1 Training and Testing Datasets
The dataset contains 5000000 network connection records. A connection is a sequence of TCP packets starting and ending at defined times, between which data
flows from a source IP to a target IP under certain protocol [6]. The training portion
of the dataset ( “kdd_10_percent”) contains 494021 connections of which 20% are
normal (97277), and the rest (396743) are attacks. Each connection record contains 41
independent fields and a label (normal or type of attack). Attacks belong to the one of
the four attack categories: user to root, remote to local, probe, and denial of service.
The testing dataset (“corrected”) provides a dataset with a significantly different statistical distribution than the training dataset (250436 attacks and 60593 normal connections) and contains an additional 14 attacks not included in the training dataset.
The most important flaws of the mentioned dataset are given in the following [7].
The dataset contains biases that may be reflected in the performance of the evaluated
systems, for example the skewed nature of the attack distribution. None of the sources
explaining the dataset contains any discussion of data rates, and its variation with time
is not specified. There is no discussion of whether the quantity of data presented is
sufficient to train a statistical anomaly system or a learning-based system.
Furthermore, in [8] is demonstrated that the transformation model used for transforming raw DARPA’s network data to a well-featured data item set is ‘poor’. Here
‘poor’ refers to the fact that some attribute values are the same in different data items
that have different class labels. Due to this, some of the attacks can’t be classified
correctly.
4.2 Obtained Rates
The system was trained using “kdd_10_percent” and tested on “corrected” dataset.
The obtained results are summarized in Table 3. The last column gives the value of
classical F-measure so that learning results could be easily compared with a unique
feature for both recall and precision. The false-positive rate is reduced from 40.7% to
Evaluating Sequential Combination of Two Genetic Algorithm-Based Solutions
153
1.4%, while the detection rate has reduced for only 0.15%. The increasing of F-value
is also exhibited.
Table 3. The performance of the whole system and its parts separately
System
Linear Classifier
Rule-based
Whole system
Detection rate
Num.
Per.(%)
231030
92.25
45504
75.1
230625
92.1
False Positive Rate
Num.
Per.(%)
24628
40.7
5537
2.2
862
1.4
F-measure
0.913
0.815
0.96
The adaptability of the system was tested as well. At first, the system was trained
with a subset of “kdd_10_percent” (250000 connections out of 491021). The generated rules were taken as the initial generation and re-trained with the remaining data
of “kdd_10_percent” dataset. Both of the systems were tested on “corrected” dataset.
The system exhibited improvements in both detection and false positive rate. The
improvements are presented in the Table 4.
Table 4. The performance of the system after re-training
System
Trained with a subset
Re-trained with the rest of the
training data
Detection rate
Num.
Per.(%)
183060
73.1
231039
92.3
False Positive rate
Num.
Per.(%)
1468
2.4
862
1.4
Fmeasure
0.84
0.96
The drawbacks of the dataset have influenced the gained rates. As a comparison,
the detection rate of the system tested on the same data that it was trained on, i.e.
“kdd_10_percent”, is 99.2% comparing to the detection rate of 92.1% after testing the
system using “corrected” dataset. Thus, the dataset deficiencies stated previously in
this section had negative effects on the rates obtained in this work.
5 Conclusions
In this work a novel approach consisting in a serial combination of two GA-based
IDSes is introduced. The properties including adaptability of the resulting system
were analyzed. The resulting system exhibits very good characteristics in the terms of
both detection and false-positive rate and the F-measure.
The implementation of the system has been performed in a way that corresponds
well to the deployed dataset mostly in the terms of the chosen features. In a real system this does not have to be the case. Thus, an implementation of the system for a
real-world environment has to be adjusted in the sense that the set of the chosen features has to be changed according to the environmental changing.
Due to the inherent high parallelism of the presented system, there is a possibility
of its implementation using reconfigurable hardware. This will result in a highperformance real-word implementation with considerably lower implementation cost,
154
Z. Banković, S. Bojanić, and O. Nieto-Taladriz
size and power consumption compared to the existing solutions. Part of the future
work will consist in pursuing hardware implementation of the presented system.
Acknowledgements. This work has been partially funded by the Spanish Ministry of
Education and Science under the project TEC2006-13067-C03-03 and by the European Commission under the FastMatch project FP6 IST 27095.
References
1. Banković, Z., Stepanović, D., Bojanić, S., Nieto-Taladriz, O.: Improving Network Security
Using Genetic Algorithm Approach. Computers & Electrical Engineering 33(5-6), 438–451
2. Grosan, C., Abraham, A., Chis, M.: Computational Intelligence for light weight intrusion
detection systems. In: International Conference on Applied Computing (IADIS 2006), San
Sebastian, Spain, pp. 538–542 (2006); ISBN: 9728924097
3. Gong, R.H., Zulkernine, M., Abolmaesumi, P.: A Software Implementation of a Genetic
Algorithm Based Approach to Network Intrusion Detection. In: Proceedings of
SNPD/SAWN 2005 (2005)
4. Chittur, A.: Model Generation for an Intrusion Detection System Using Genetic Algorithms (accessed in 2006), http://www1.cs.columbia.edu/ids/
publications/gaids-thesis01.pdf
5. Weiss, G.: Mining with rarity: A unifying framework. SIGKDD Explorations 6(1), 7–19
(2004)
6. http://kdd.ics.uci.edu/ (October 1999)
7. McHugh, J.: Testing Intrusion Detection Systems: A Critique of the 1998 and 1999
DARPA IDS Evaluation as Performed by Lincoln Laboratory. ACM Trans. on Information and System security 3(4), 262–294 (2000)
8. Bouzida, Y., Cuppens, F.: Detecting known and novel network intrusion. In: IFIP/SEC
2006 21st International Information Security Conference, Karlstad, Sweden (2006)
9. Goldberg, D.E.: Genetic algorithms for search, optimization, and machine learning. Addison-Wesley, Reading (1989)
10. GAlib, A.: C++ Library of Genetic Algorithm Components,
http://lancet.mit.edu/ga/
11. Pan, Z., Chen, S., Hu, G., Zhang, D.: Hybrid Neural Network and C4.5 for Misuse Detection. In: Proceedings of the Second International Conference on Machine Learning and
Cybernetics, November 2003, vol. 4, pp. 2463–2467 (2003)
12. Folino, G., Pizzuti, C., Spezzano, G.: GP Ensemble for Distributed Intrusion Detection
Systems. In: Singh, S., Singh, M., Apte, C., Perner, P. (eds.) ICAPR 2005. LNCS,
vol. 3686. Springer, Heidelberg (2005)
13. Laskov, P., Düssel, P., Schäfer, C., Rieck, K.: Learning Intrusion Detection: Supervised or
Unsaupervised? In: Roli, F., Vitulano, S. (eds.) ICIAP 2005. LNCS, vol. 3617, pp. 50–57.
Springer, Heidelberg (2005)
14. Yao, J.T., Zhao, S.L., Saxton, L.V.: A Study on Fuzzy Intrusion Detection. Data mining,
intrusion detection, information assurance and data networks security (2005)
15. Chawla, N.V., Lazarevic, A., Hall, L.O., Bowyer, K.: SMOTEBoost: Improving prediction
of the minority class in boosting. In: Proceedings of Principles of Knowledge Discovery in
Databases (2003)
16. Fodor, I.K.: A Survey of Dimension Reduction Techniques,
http://llnl.gov/CASC/sapphire/pubs