38 The review of literature given in this chapter is centered upon

38
The review of literature given in this chapter is centered upon new
studies for classifier based text mining approaches for data mining applications
and ensemble methods.
Early work (Hansen & Salamon, 1990) on ensemble suggested that
ensembles with as few as ten members were adequate to sufficiently reduce
test-set error.
Michie, Spiegelhalter, and Taylor (1994) try to find the relationship
between the best performing method and data types of input/output variables.
However, the common understanding of data mining practitioners and
researchers is that there does not exist a universal best-performing method.
That is, different kinds of methods have their own advantages and defects. So,
a method can perform best for one specific problem, but given another
problem, another method can work better. This situation is called selective
superiority (Michie et al.,1994).
Many researchers have investigated the technique of combining the
predictions of multiple classifiers to produce a single classifiers (Breiman
1996c; Clemen, 1989; Perrone, 1993; Wolpert, 1992). The resulting classifier
(hereafter referred to as an ensemble) is generally more accurate than any of
the individual classifiers making up the ensemble. Both theoretical (Hansen &
Salamon, 1990; Krogh & Vedelsby, 1995) and empirical (Hashem, 1997; Opitz
& Shavlik, 1996a, 1996b) research has demonstrated that a good ensemble is
one where the individual classifiers in the ensemble are both accurate and
make their errors on different parts of the input space. Two popular methods
for creating accurate ensembles are bagging (Breiman, 1996c) and Boosting
39
(Freund
&
Schapire,
1996;
Schapire,
1990).
These
methods rely
on
“resampling” techniques to obtain different training sets for each of the
classifiers. This work presents a comprehensive evaluation of bagging on data
mining problems using four basis classification methods: k-Nearest Neighbor
(k-NN), Radial Basis Function (RBF), Multilayer Perceptron (MLP), and Support
Vector Machine (SVM).
Combining the output of several classifiers is useful only if there is
disagreement among them. Obviously, combining several identical classifiers
produces no gain. Hansen and Salamon (1990) proved that if the average error
rate for an example is less than 50% and the component classifiers in the
ensemble are independent in the production of their errors, the expected error
for that example can be reduced to zero as the number of classifiers combined
goes to infinity; however, such assumptions rarely hold in practice. Krogh and
Vedelsby (1995) later proved that the ensemble error can be divided into a term
measuring the generalization error of each individual classifier and a term
measuring the disagreement among the classifiers. What they formally showed
was that an ideal ensemble consists of highly correct classifiers that disagree
as much as possible. Opitz and Shavlik (1996a, 1996b) empirically verified that
such ensembles generalize well.
Breiman (1996c) showed that bagging is effective on “unstable” learning
algorithms where small changes in the training set result in large changes in
predictions. Breiman (1996c) claimed that neural networks and decision trees
are example of unstable learning algorithms.
If the learning algorithm has certain learning parameters, set each base
model to have different values, for example, in the area of neural networks one
can set each base model to have different initial random weights or a different
topology. (Sharkey,A. 1996)
40
The boosting literature (Schapire, Freund, Bartlett, & Lee, 1997) has
recently suggested (based on a few data sets with decision trees) that it is
possible to further reduce the test-set error even after ten members have been
added to an ensemble (and they note that this result also applies to bagging).
The idea of combining fitted values from a number of fitting attempts has
been suggested by several authors (Leblance and Tibsharini, 1996; Mojirshebeibani, 1997; 1999; Mertz, 1999). In an important sense, the whole becomes
more than the sum of its parts. Perhaps the earliest procedure to exploit a
combination of “random trees” is bagging. (Breiman, 1996).
In Supervised learning (Mitchell, T. (1997), a pool labeled data, S, is used
to predict the labels of unseen data. Using S, an empirical accuracy can be
calculated and this can be used as an estimate of the generalization accuracy.
Two accepted techniques for estimating the generalization accuracy are
subsampling and k-fold cross validation. Both techniques may employ a
stratified partitioning in which the subsets contain approximately the same
proportion of classes as S.
System can be monitored at various levels. Various factors including
cost, accuracy, and ability to differentiate normal from abnormal behavior
influence the choice. Typically, intrusion detection systems monitor either
behavior or privileged processes. Although the former method was more
popular earlier (Denning, 1987), recent studies have used the latter method
(Lee, Stolfo, & Mok, 1998; Hofmeyr et al., 1998). (Hofmeyr et al., 1998) found
that short sequences of system calls are a good discriminator for several types
of intrusion. Ensemble improves prediction performance by the combined use
of two effects: reduction of errors due to bias and variance (Haykin, 1999).
The purpose of ensemble learning is to build learning models which
integrates a number of base learning models. So that the model gives better
41
generalization performance on application to a particular dataset than any of
the individual base models (Dietterich,T. 2000).
Ensemble methods often perform extremely well and in many cases, can
be shown to have desirable statistical properties. (Breiman, 2001a, 2001c).
Fang b, et al., (2003) proposed, two methods to track the variations in
the signature patterns written by the same person. The variations can occur in
the shape or in the relative positions of the characteristic features. Given the
set of training signature samples, the first method measures the positional
variations of the one-dimensional projection profiles of the signature patterns;
and the second method determines the variations in relative stroke positions in
the two-dimension signature patterns.
Kagan Tumer and Nikunj C. Oza (2003) have shown from the results that
input decimated ensembles outperform ensembles whose base classifiers use
all the input features; randomly selected subsets of features; and features
created using principal components analysis, on a wide range of domains.
Niall Rooney, et.al (2004) investigated an algorithmic extension to the
technique of Stacked Regression that prunes the size of a homogeneous
ensemble set based on a consideration of the accuracy and diversity of the set
members. They showed that the pruned ensemble set is as accurate on average
over the data-sets tested as the non-pruned version, which provides benefits in
terms of its application efficiency and reduced complexity of the ensemble.
Ensemble methods could perhaps do that job better (Berk et al, 2004).
That is, the selection process could be better captured and the probability of
membership in each treatment group estimated with less bias. More credible
estimates of intervention effects would follow.
P. M. Granitto, P. F. Verdes, H. A. Ceccatto (2005) present an extensive
evaluation of several algorithms for ensemble construction, including new
42
proposals and comparing them with standard methods in the literature and
their algorithms and the weighted modifications are favorably tested against
other methods in the literature, producing a sensible improvement in
performance
on
most
of
the
standard
statistical
databases
used
as
benchmarks.
Zonghua Zhang, Hong Shen (2005) modified the conventional SVM,
Robust SVM and one-class SVM respectively based on the idea from Online
SVM in this paper, and their performances are compared with that of the
original algorithms. After elaborate theoretical analysis, concrete experiments
with 1998 DARPA BSM data set collected at MIT's Lincoln Labs are carried out.
These experiments verify that the modified SVMs can be trained online and the
results outperform the original ones with fewer support vectors (SVs) and less
training
time
without
decreasing
detection
accuracy.
Both
of
these
achievements could significantly benefit an effective online intrusion detection
system.
Gavin Brown, Jeremy L. Wyatt, Peter Tiňo (2005) present the results of
an empirical study, showing significant improvements over simple ensemble
learning, and finding that this technique is competitive with a variety of
methods, including boosting, bagging, mixtures of experts, and Gaussian
processes, on a number of tasks.
Oliver Buchtala, Manuel Klimek, and Bernhard Sick (2005) described an
evolutionary algorithm (EA) that performs feature and model selection
simultaneously for radial basis function (RBF) classifiers. In order to reduce the
optimization effort, various techniques are integrated that accelerate and
improve the EA significantly: hybrid training of RBF networks, lazy evaluation,
consideration of soft constraints by means of penalty terms, and temperaturebased adaptive control of the EA. The feasibility and the benefits of the
approach are demonstrated by means of four data mining problems: intrusion
detection in computer networks, biometric signature verification, customer
43
acquisition with direct marketing methods and optimization of chemical
production processes. It is shown that, compared to earlier EA-based RBF
optimization techniques, the runtime is reduced by up to 99% while error rates
are lowered by up to 86%, depending on the application. The algorithm is
independent of specific applications so that many ideas and solutions can be
transferred to other classifier paradigms.
Songbo Tana (2006) proposed a new refinement strategy, which is called
as Drag Pushing, for the KNN Classifier. The experiments on three benchmark
evaluation
collections
show that
Drag
Pushing
achieved
a
significant
improvement on the performance of the KNN Classifier.
Alok Sharma, Arun K. Pujari, Kuldip K. Paliwal (2007) focused on
intrusion detection based on system call sequences using text processing
techniques. It introduces kernel based similarity measure for the detection of
host-based intrusions. The k-nearest neighbour (kNN) classifier is used to
classify a process as either normal or abnormal. The proposed technique is
evaluated on the DARPA-1998 database and its performance is compared with
other existing techniques available in the literature. It is shown that this
technique is significantly better than the other techniques in achieving lower
false positive rates at 100% detection rate.
Hyoung-joo Lee, Sungzoon Cho (2007) proposed to use novelty detection
approaches to alleviate the class imbalance in response modeling. Two novelty
detectors, one-class support vector machine (1-SVM) and learning vector
quantization for novelty detection (LVQ-ND), are compared with binary
classifiers for a catalogue mailing task with DMEF4 dataset. The novelty
detectors are more accurate and more profitable when the response rate is low.
When the response rate is relatively high, however, a support vector machine
model with modified misclassification costs performs the best. In addition, the
novelty detectors turn in higher profits with a low mailing cost, while the SVM
model is the most portable with a high mailing cost.
44
Sandhya Peddabachigari, Ajith Abraham, Crina Grosan, Johnson
Thomas (2007) presents two hybrid approaches for modeling IDS. Decision
trees (DT) and support vector machines (SVM) are combined as a hierarchical
hybrid intelligent system model (DT–SVM) and an ensemble approach
combining the base classifiers. The hybrid intrusion detection model combines
the individual base classifiers and other hybrid machine learning paradigms to
maximize
detection
accuracy
and
minimize
computational
complexity.
Empirical results illustrate that the proposed hybrid systems provide more
accurate intrusion detection systems.
Taeshik Shon, Jongsub Moon (2007) proposed a new SVM approach,
named Enhanced SVM, which combines these two methods in order to provide
unsupervised learning and low false alarm capability, similar to that of a
supervised SVM approach.
Nikunj C. Oza, Kagan Tumer (2008) showed that Mathematically,
classifier ensembles provide an extra degree of freedom in the classical
bias/variance tradeoff, allowing solutions that would be difficult (if not
impossible) to reach with only a single classifier. Because of these advantages,
classifier ensembles have been applied to many difficult real-world problems.
They survey selected applications of ensemble methods to problems that have
historically been most representative of the difficulties in classification. In
particular, they surveyed applications of ensemble methods to remote sensing,
person recognition, one vs. all recognition and medicine.
Response modeling has become a key factor to direct marketing. In
general, there are two stages in response modeling. The first stage is to identify
respondents from a customer database while the second stage is to estimate
purchase amounts of the respondents. (Dongil Kim, Hyoung-joo Lee, Sungzoon
Cho, 2008) focused on the second stage where a regression, not a
classification, problem is solved. Recently, several non-linear models based on
45
machine learning such as support vector machines (SVM) have been applied to
response modeling.
Rachid Beghdad (2008) present a critical study about the use of some
neural networks (NNs) to detect and classify intrusions. The aim of research is
to determine which NN classifies well the attacks and leads to the higher
detection rate of each attack. This study focused on two classification types of
records: a single class (normal, or attack), and a multiclass, where the category
of attack is also detected by the NN. Five different types of NNs were tested:
multilayer perceptron (MLP), generalized feed forward (GFF), radial basis
function (RBF), self-organizing feature map (SOFM), and principal component
analysis (PCA) NN. In the single class case, the PCA NN performs the higher
detection rate.
Xuchun
incorporating
Li,
Lei
Wang,
Eric
properly designed
Sung
(2008)
show
that
RBFSVM (SVM with the
AdaBoost
RBF kernel)
component classifiers, which is called AdaBoostSVM, can perform as well as
SVM.
Furthermore,
the
proposed
AdaBoostSVM
demonstrates
better
generalization performance than SVM on imbalanced classification problems.
The key idea of AdaBoostSVM is that for the sequence of trained RBFSVM
component classifiers, starting with large σ values (implying weak learning), the
σ values are reduced progressively as the Boosting iteration proceeds. This
effectively produces a set of RBFSVM component classifiers whose model
parameters are adaptively different manifesting in better generalization as
compared to AdaBoost approach with SVM component classifiers using a fixed
(optimal) σ value. From benchmark data sets, it is shown that their
AdaBoostSVM
approach
outperforms other
AdaBoost
approaches
using
component classifiers such as Decision Trees and Neural Networks.
Wing W.Y. Ng, Daniel S. Yeung, Michael Firth, Eric C.C. Tsang, Xi-Zhao
Wang (2008) proposed a novel hybrid filter–wrapper-type feature subset
selection methodology using a localized generalization error model. In the
46
experiments for two of the datasets, the classifiers built using feature subsets
with 90% of features removed by our proposed approach yield average testing
accuracies higher than those trained, using the full set of features.
Tian Xinguang, Duan Miyi, Sun Chunlai, Li Wenfa (2008) A novel method
for detecting anomalous program behavior is presented, which is applicable to
host based intrusion detection systems that monitor system call activities. The
method constructs a homogeneous Markov chain model to characterize the
normal behavior of a privileged program, and associates the states of the
Markov chain with the unique system calls in the training data. At the
detection stage, the probabilities that the Markov chain model supports the
system call sequences generated by the program are computed.
Yaochu Jin and Bernhard Sendhoff (2008) presented an overview of the
existing research on multiobjective machine learning, focusing on supervised
learning. In addition, a number of case studies are provided to illustrate the
major benefits of the Pareto-based approach to machine learning, e.g., how to
identify interpretable models and models that can generalize on unseen data
from the obtained Pareto-optimal solutions. Three approaches to Pareto-based
multiobjective ensemble generation are compared and discussed in detail.
Finally, potentially interesting topics in multiobjective machine learning are
suggested.
Sebastián Maldonado, Richard Weber (2009) introduced a novel wrapper
Algorithm for Feature Selection, using Support Vector Machines with kernel
functions. This method is based on a sequential backward selection, using the
number of errors in a validation subset as the measure to decide which feature
to remove in each iteration. They compared their approach with other
algorithms like a filter method or Recursive Feature Elimination SVM to
demonstrate its effectiveness and efficiency.
47
Ioannis Partalas, Grigorios Tsoumakas, Ioannis Vlahavas (2009) studied
the problem of pruning an ensemble of classifiers from a reinforcement
learning perspective. It contributes a new pruning approach that uses the
Q-learning algorithm in order to approximate an optimal policy of choosing
whether to include or exclude each classifier from the ensemble. Extensive
experimental comparisons of the proposed approach against state-of-the-art
pruning and combination methods show very promising results.
Following this literature review, the work presented in the remainder of
the thesis is guided by the above considerations.