improving the performance stability of inductive expert systems

IMPROVING THE PERFORMANCE STABILITY OF INDUCTIVE EXPERT
SYSTEMS UNDER INPUT NOISE
Vijay S. Mookerjee, Michael V. Mannino, and Robert Gilson
Department of Management Science, Box 353200
University of Washington, Seattle, WA 98195-3200
[email protected], [email protected],
[email protected]
This paper appears in Information Systems Research 6, 4 (December 1995), 328-356.
IMPROVING THE PERFORMANCE STABILITY OF INDUCTIVE EXPERT
SYSTEMS UNDER INPUT NOISE
Abstract
Inductive expert systems typically operate with imperfect or noisy input attributes.
We study design differences in inductive expert systems arising from implicit
versus explicit handling of input noise. Most previous approaches use an implicit
approach wherein inductive expert systems are constructed using input data of
quality comparable to problems the system will be called upon to solve. We
develop an explicit algorithm (ID3ecp) that uses a clean (without input errors)
training set and an explicit measure of the input noise level and compare it to a
traditional implicit algorithm, ID3p (the ID3 algorithm with the pessimistic pruning
procedure). The novel feature of the explicit algorithm is that it injects noise in a
controlled rather than random manner in order to reduce the performance variance
due to noise. We show analytically that the implicit algorithm has the same
expected partitioning behavior as the explicit algorithm. In contrast, however, the
partitioning behavior of the explicit algorithm is shown to be more stable (i.e.,
lower variance) than the implicit algorithm. To extend the analysis to the
predictive performance of the algorithms, a set of simulation experiments is
described in which the average performance and coefficient of variation of
performance of both algorithms are studied on real and artificial data sets. The
experimental results confirm the analytical results and demonstrate substantial
differences in stability of performance between the algorithms especially as the
noise level increases.
1. Introduction
Inductive expert systems have become an important decision support tool as evidenced by
considerable attention in the academic literature and business press, and a number of commercial
products to develop such systems. Inductive expert systems are typically developed to support
classification tasks, i.e., systems that attempt to classify an object as one of n categories [Quinlan
1986a]. Examples of classification in business decision making include fault diagnosis in
semiconductor manufacturing [Irani et al. 1993], bank failure prediction [Tam and Kiang 1990],
and industry and occupation code prediction [Creecy et al. 1992]. The primary goal of an
inductive expert system is to perform at the same level of human experts. Such systems can
provide many benefits to an organization [Holsapple and Whinston 1987] such as reducing
decision making time, improving the consistency of decisions, and reducing dependence on scarce
human experts.
An inductive expert system is constructed using a learning algorithm and data set. A
learning algorithm develops classification rules that can be used to determine the class of an object
from its description, i.e., from the object's attributes. The classification rules developed by these
algorithms can be depicted as a decision tree in which the non-leaf nodes of the tree prescribe
inputs that must be observed and the arcs represent states that the input variables can take. Leaf
nodes in the tree indicate how an object is to be classified. Induction algorithms build such a tree
from a set of pre-classified cases referred to as the training set. Another part of the data set
known as the test set is used to study the performance of a decision tree on novel cases. Inductive
expert systems are typically developed to maximize solution accuracy; i.e., maximize the number
of cases in which the output (decision, recommendation) provided by the system is similar to that
provided by human experts. Economic considerations, for example, costs of observing inputs,
Page 2
benefits from system outputs, and other factors that may contribute to system value are rarely
factored into system design [Mookerjee and Dos Santos 1993].
The subject of this paper is the design and performance evaluation of inductive expert
systems using noisy input attributes. The presence of input noise can have a significant impact on
the performance of an inductive expert system. We only consider noise that affects the input
1
values used by the system to make classification decisions, not other forms of noise. This
definition includes errors from such causes as incorrectly measuring an input, wrongly reporting
the state of an input, relying on stale values, and using imprecise measurement devices. Input
errors in a training set can cause a learning algorithm to form a rule with an incorrect state for an
input, while input errors in cases to be classified can cause the wrong rule to be used.
The specific issue studied here is how to account for the level of input noise: (i) implicitly
through a training set with a representative level of noise or (ii) explicitly through a noise
parameter and a clean training set. Figure 1 graphically depicts the explicit and implicit
approaches. In common practice, the implicit approach is used because it is cost effective and has
been carefully studied. However, high variance of performance is a key disadvantage of the
implicit approach that has not been widely discussed or studied. Our most important finding here
is that an induction algorithm using an explicit noise parameter can have more stable performance
than a comparable implicit algorithm.
Variation in performance is an outcome variable of interest in a wide variety of systems.
For example, the performance of a manufacturing process is judged in terms of its mean behavior
1
It must be noted however, that researchers in the past have discussed other dimensions of noise such as
conflicts in the classification, classification errors, missing input data, etc. In this paper, we will use the
term noise in a restricted sense, only with reference to input measurement accuracy. Thus, for our
purposes, noise rises if the input measurement accuracy falls. Conversely, noise reduces if the input
measurement accuracy rises.
Page 3
as well as by the variation in its behavior. Reducing variance is routinely used as a performance
objective in survey research where errors may be introduced by interviewers, respondents,
questionnaires, processing of forms, and so on. In the context of inductive expert systems
however, past research has largely ignored this important design objective. Variation in
performance could often be an extremely important aspect of an inductive expert system. For
example, if a loan granting inductive expert system was to make very good decisions in one set of
cases, but extremely poor ones in another set, it may cause the bank to fail in the period of poor
performance. Thus, managers are likely to prefer a more stable system to a highly variable one
even though the mean performance of the two systems is the same.
Clean Training Set
Noisy Training Set
Noise
Parameter
Explicit Algorithm
Decision Tree
Implicit Algorithm
Decision Tree
Figure 1: Explicit and Implicit Noise Handling
In addition to stable performance, explicit noise handling is interesting to study because
there are a number of situations in which an explicit approach is more practical than an implicit
Page 4
approach. One such situation is when the training set is obtained from experts rather than from
historical data. Here, an implicit approach would require that the input states in the training set be
deliberately corrupted. In addition, if there are multiple ways to measure an input or if the level of
noise in historical data is not representative of current practices, an explicit approach may be
preferable.
Because explicit algorithms have not been carefully studied, the major topics presented are
the design and performance evaluation of explicit algorithms. We design an explicit algorithm
that injects noise according to a specified noise parameter as it partitions a data set. The novel
aspect of the algorithm is that it injects noise in a controlled rather than random manner in order
to reduce the variance due to noise. We show analytically that the expected partitioning behavior
of the implicit and explicit algorithms is the same. However, we demonstrate analytically that the
explicit algorithm has more stable partitions than the implicit algorithm. To extend the analytical
results to the classification accuracy of the algorithms, we conduct a set of simulation experiments
on several real and artificial data sets with a range of values for the number of classes and
skewness in the class distributions. Our experimental results reveal that the expected accuracy
and variance of accuracy of the algorithms are consistent with our analytical results.
This rest of this paper is organized as follows. In Section 2, we review related work on
noise handling approaches used in inductive expert systems. In Section 3, we provide background
on decision tree induction and analytically evaluate the behavior of the implicit and explicit
algorithms: ID3p and ID3ecp. In Section 4, we describe the hypotheses investigated, experimental
designs used, and results obtained from a set of simulation experiments. A summary and
conclusions are provided in Section 5.
Page 5
2. Related Work
In this section, we summarize a theoretical study of input noise, various pruning
procedures, and models to cope with measurement errors in surveys. Laird [Laird 1988] studied
the Bernoulli Noise Process (BNP) as an extension of the theory of Probably, Approximately
Correct (PAC) learning. The goal of PAC theory is to derive an upper bound on the number of
examples needed to approximately learn a concept within a given error bound with a specified
confidence level. A BNP is characterized by independent parameters for the classification error
rate and the input error rate. Laird's basic result is that input error rate alone is not sufficient to
determine the maximum number of examples needed. There must be an additional parameter that
indicates the sensitivity of the true concept definition to input errors. This theoretical result about
the relationship between the importance of an attribute and the impact of noise has been
empirically demonstrated in other studies.
In more applied studies, researchers have developed pruning procedures (post
construction techniques) to refine the rule set generated by a learning algorithm. Learning
algorithms generally find a perfect set of rules for a training set, but the rules are usually too
specialized leading to poor performance on unseen cases. Pruning techniques reduce
specialization by eliminating rules in whole or part. Similarly, pruning techniques have also been
found useful to handle noise because noise in a training set can lead to extra rules and highly
specialized rules. For example, Quinlan [Quinlan 1987] demonstrated that four pruning methods
significantly reduced the complexity of induced decision trees without adversely affecting
accuracy. He later found on a study of the chi-square pruning technique [Quinlan 1986b] that the
performance of a learning algorithm is better using a noisy training set than a perfect test set. He
also found that increases in predictive performance from noise reduction depends on the
importance of an input.
Page 6
Two important themes in the development of pruning procedures are the use of an extra
test set and parameters to control the amount of pruning. A number of techniques use an extra
test set to choose among multiple collections of rules (critical value pruning [Mingers 1989] and
error complexity pruning [Breiman et al. 1984]) or to reduce the complexity of the rules
(minimum error pruning [Quinlan 1987]). These techniques require more data than techniques
that prune using the training set alone (pessimistic pruning [Quinlan 1987], minimum error
pruning [Niblett and Bratko 1986], and Laplace pruning [Christie 1993]). Mingers [Mingers
1989] found that pruning techniques using an extra test set achieved higher accuracy than
techniques not using an extra test set. However, Quinlan [Quinlan 1987], in an earlier study, did
not find higher accuracy as a result of extra test cases. Several techniques have been developed
that use a single parameter to control the amount of pruning (error complexity pruning [Breiman
et al. 1984] and m-probability-estimate pruning [Cestnik and Bratko 1991]). The parameters are
rather coarse applying to the entire data set, not individual inputs. In addition, there are no
guidelines for setting parameter values except that high values should be used when there is a
large amount of noise. Moulet [Moulet 1991] developed input noise parameters for the
ABACUS discovery system and demonstrated their effectiveness in learning simple laws of
physics. However, he did not apply his technique to classification problems.
Unlike the work on pruning techniques, the area of measurement errors in surveys
[Groves 1991] includes a rich stream of research on techniques to measure and reduce the level of
input noise and models to compensate for the effect of input noise on prediction tasks. This area
of research has developed a detailed classification of input noise beginning with systematic errors
that introduce bias and random errors that cause variance. Beyond this division, the source of
errors (interviewer, respondent, process, questionnaire) and the cause of errors (e.g., memory loss
and non-response) are often identified. When the level of input noise is not known, it can be
estimated using a re interview technique [Hill 1991, Rao and Thomas 1991] where error-prone
measurements are made on a sample and then more expensive and relatively error-free
Page 7
measurements are made on a sub sample. The re interview technique is similar to the idea of
using a clean training set with an explicit estimate of the noise level. Many models have been
developed to compensate for the effect of input noise on regression [Fuller 1991], analysis of
variance [Biemer and Stokes 1991], and estimation of survey statistics of categorical data [Biemer
and Stokes 1991]. However, there is no reported research on classification tasks.
3. Algorithm Design and Analysis
In this section, we present the explicit noise algorithm used in our simulation experiments
and analyze the mean and variance of its partitioning behavior. Before presenting the explicit
noise algorithm, we review the induction process underlying the baseline algorithm, ID3p, (the
ID3 algorithm [Quinlan 1986a] with the pessimistic pruning procedure).
3.1. Decision Tree Induction
Induction algorithms develop a decision tree by recursively creating non-leaf nodes and
leaf nodes in the tree. Non-leaf nodes are labeled by input names. The input chosen to label a nonleaf node is determined using an input selection criterion and a set of cases. Traditionally, inputs
are selected by their information content, measured by the reduction in information entropy
achieved as a result of observing the input. After an input has been chosen to label a non-leaf
node, each of the q outgoing arcs are labeled by a possible state of the selected input where q is
the number of possible states. The set of cases used to compute the label of a node (for the root
node this is the entire training set) is then partitioned into q subsets such that the state of the input
used to label the node is the same within each subset. The tree can grow along each outgoing arc
using the subset of cases corresponding to the state of the input used to label the arc. Creation of
non-leaf nodes continues along each path of the tree until a stopping condition is reached, at
which stage a leaf node is created. Leaf nodes are labeled using a classification function.
Page 8
An important factor that must be considered in the design of an induction algorithm is the
input selection criterion. The input selection criterion determines how, from a set of candidate
inputs, an input is chosen to label a non-leaf node. We describe the input selection criterion for
the ID3 algorithm in the remainder of this subsection.
Let,
D N
be a randomly drawn training set of size N,
X 1 , X 2 ,..., X p
be p observable input variables that may be used to classify an object,
xk1, xk2, .., xkq
be q possible states for input X k ,
ψ
be the random variable for the class of an object,
c1, c2, .., cm
~
Z
be m possible states for the class variable ψ ,
~
L( Z , π )
be a partition of the training set (Z ⊆ D N ),
~
~
be an input state conjunction, e.g., X 1 = x14 ∧ X 2 = x 23 ,
~
be a partition of Z such that π is true,
k ( Z )
be a classification function that determines how a leaf node is labeled,
g ( X k | Z )
be the expected gain for input X k given Z .
π
The input selection criterion in the ID3 algorithm chooses the input with the maximum
information content (gain) measured by the reduction in information entropy [Shannon and
Weaver 1949] as a result of observing the input. Formally, the gain of input X k is defined as:
g ( X k | Z ) = ∆EN ( X k | Z ) = EN ( Z ) − EN ( X k | Z )
where
~
EN ( Z ) = −
m
∑ P[ψ~ = cr | Z ]log2 P[ψ~ = cr | Z ] , is the initial expected entropy,
~
~
(1)
(2)
r =1
~ ~
EN ( X k | Z ) =
q
∑ P[ X k = xkj | Z ]EN ( L( Z , X k = xkj )) ,
j =1
~
~
~ ~
~
is the expected entropy after observing X k , and
(3)
Page 9
EN ( L ( Z , X k = xkj )) =
m
−
∑ P[ψ = cr | L( Z , X k = xkj )]log2 P[ ψ = cr | L( Z , X k = xkj )]
(4)
r =1
is the expected entropy in the partition L ( Z , X k = x kj )
ID3 estimates the probabilities in equations (2) through (4) by sample proportions obtained from
the training data.
~
To demonstrate the above equations, consider a sample of 100 cases ( Z ) with 2 classes (c1 =
~
66, c2 = 34). Let input X k with states xk1, xk2, xk3 be used to partition the 100 cases. From
equation (2) the initial entropy is:
~
EN ( Z ) = −(0.66 *log 2 0.66 + 0.34 *log 2 0.34) = 0.9248
~
For X k = x k 1 , let c1 = 48 and c2 = 12 be the subpartition of cases obtained. The entropy within
this subpartition is:
~ ~
EN ( L( Z , X k = xk 1 )) = −(0.8 *log 2 0.8 + 0.2 *log 2 0.2) = 0.7219
~
~ ~
For X k = x k 2 , let c1 = 10 and c2 = 20 be the subpartition with EN ( L( Z , X k = xk 2 )) = 0.9149
~
computed the same as above. For X k = x k 3 , let c1 = 8 and c2 = 2 be the subpartition with
~ ~
~
EN ( L( Z , X k = xk 3 )) = 0.7219 . From equation (3), the entropy after observing X k is:
~ ~
EN ( X k | Z ) = (0.6 * 0.7219 + 0.3 * 9149 + 01
. * 0.7219) = 0.7798
~
Thus from equation (1) the gain from observing input X k is:
~ ~
~ ~
g ( X k | Z ) = ∆EN ( X k | Z ) = 0.9248 − 0.7798 = 01449
.
To cope with noise and overfitting, we augment ID3 with the pessimistic pruning
procedure [Quinlan 1987]. We call the augmented ID3 algorithm, ID3p. The pessimistic pruning
procedure uses a statistical measure to determine whether replacing a sub-tree by its best leaf (i.e.,
the classification that maximizes accuracy for a given set of cases) is likely to increase accuracy.
If so, the branch is replaced by its best leaf; otherwise the branch is retained. Once all branches
have been examined, the process terminates. A more detailed description can be found in
Page 10
Appendix B. We use the pessimistic pruning procedure because it is easy to implement, effective
on noisy data [Mingers 1989], and does not require an extra test set.
3.2. Model of Noise
Inductive expert systems typically operate under conditions where inputs are subject to
noise. Noise occurs when the true input state is perturbed by a measurement process. We
assume that measurement errors is independent of the time of measurement and the true state of
an input. If we ignore differences among wrong states (e.g., measuring a high value as medium or
low), noise can be characterized as a binomial process with mean C, the probability of correct
measurement. The value of C can be estimated from empirical data as follows. First, for each
noisy input, collect a representative sample of input values. Second, correct the noisy values
through various techniques such as repeated measurement [Hill 1991, Rao and Thomas 1991]
and/or improved measurement, that is, using more expensive and relatively error free devices.
Third, estimate the population error rate from the sample as the number of values that are
incorrect divided by the sample size. One minus the sample error rate is an unbiased estimate of
C.
We define the parameter W as the probability of wrong or incorrect measurement, a
measure of the likelihood of disruptions in the measurement process that lead to a particular
2
incorrect state being recorded . We assume that wrong states are equally likely and that all inputs
have the same level of noise. Thus, the relationship C + (q-1)W = 1 holds for any input since
there are q-1 incorrect states. For example, if there is a 5 state input and the probability of correct
measurement is C = 0.8, then any wrong state has a probability of W = (1 − 0.8) (5 − 1) = 0.05.
2
The noise parameters (C and W) are output rather than process measures. In some studies, the noise
parameter is the probability that the measurement process has been perturbed. A perturbation in the
measurement process may or may not lead to an incorrect measurement. We use an output measure here
because it is more convenient, and a process measure can be mapped to an output measure.
Page 11
Although the assumptions about constant C across inputs and constant W within an input’s states
can be easily relaxed, we use them to simplify our analysis. Without these assumptions, many
more noise parameters will have to be estimated. More precisely, C and W are defined as:
C = P[ X ko = xkj | X kt = xkj ] ∀k , i
~
~
W = P[ X ko = x kj | X kt = x ki ] ∀k , ∀j ≠ i
where X ko , X kt are the random variables for the observed (noisy) and true states of input
~
Xk .
To analytically describe the impact of noise on the input gain, we state Proposition 1:
Proposition 1
~ ~
∂g ( X ko | Z C )
> 0, C ∈ (1 / q ,1]
∂C
= 0, C = 1 / q
< 0, C ∈[0,1 / q )
(6.1)
(6.2)
(6.3)
~
where Z C is a noisy partition with noise level C.
With no noise, the gain is at its highest (C = 1). As noise is increased (value of C is decreased
from 1 to 1/q), the gain decreases to the point where the value of C is 1/q (6.1). When C is equal
to 1/q, there is no benefit from observing the input because it could occur in any of its states with
equal probability (6.2). As the value of C decreases below 1/q, the noise becomes so high that the
observed states become predictably incorrect. Hence, observed states of the input again begin to
provide some information (6.3).
Thus the nature of the relationship between the noise level and the gain is convex,
validating that we are dealing with a reasonable model of noise. In Appendix A, we prove
Proposition 1 for a special case. In addition, our simulations strongly suggest that the gain
monotonically decreases (i.e., less uncertainty reduction) as the correct measurement probability
increases from 0 to 1/q and then monotonically increases until the correct measurement
probability is 1.
Page 12
3.3. Impact of Noise on Expected Behavior
In this subsection, we demonstrate how the information content of a noisy input is
computed in an implicit version of ID3 (ID3p) where a noisy training set is used and in an explicit
version of ID3 (ID3ep) where a clean training set and noise parameters are used.
3.3.1.Implicit Handling of Noise
In an implicit algorithm such as ID3p, the level of noise acts implicitly on the gain of an
input. Specifically, Equation (3) uses two probability estimates: (i) the probability of a class
given a new input and the set of previously observed inputs and (ii) the probability of a new input
given the set of previously observed inputs. With noise, we need to estimate the same two
probabilities except that it is the noisy observed states rather than true states of the input that are
estimated. Equation (7) shows the expected entropy of X ko after observing a set of noisy inputs
). ID3 estimates the probabilities in equation (7) by sample proportions obtained from the
(Π
o
p
noisy training data.
q
m
) = − ∑ P[ X
] ∑ ζ log ζ
EN ( X ko | Π
=
x
|
Π
o
ko
kj o
r
2 r
j =1
r =1
(7)
] and
where ζ r = P[ ψ = cr | X ko = xkj ∧ Π
o
Π
o = Π1o ∧... ∧ Π do is the set of previous observations on the path.
3.3.2.Explicit Handling of Noise
Explicit algorithms estimate the probabilities in equation (7) using a clean training set and
noise parameters rather than with a noisy training set. Both implicit and explicit algorithms are
subject to noise processes having the same mean noise parameters (C and W in our case). The
difference is that the noise process has already occurred for implicit algorithms as opposed to the
noise process occurring as part of explicit algorithms. Thus, implicit and explicit algorithms will
compute the same expected decision tree because the expected gain calculations are the same.
Even though explicit algorithms have better information (both the noise level and clean data),
Page 13
their expected predictive ability is the same as that of implicit algorithms. However, in subsection
3.4, we demonstrate that the additional information can be used to reduce the variance in the
predictive performance of explicit algorithms.
For explicit algorithms, one way to estimate the information content of a noisy input is to
use a recursive Bayesian updating scheme [Pearl 1988]. In the Bayesian approach, we rewrite the
3
expression for ζr from equation (7) as equation (8) using the chain propagation rule ([Pearl
1988] p.154). Because we start with a clean training set, the information content of an input in its
true state can be estimated. Note that we use the property of conditional independence wherein
the true state separates the class and the observed state ([Pearl 1988] p.154]).
q
~
~
~
~
~ = c | X~ = x ∧ Π
ζ r = ∑ P[ X kt = x ki | X ko = x kj ∧ Π o ] P[ ψ
r kt
ki
o].
=
i 1
(8)
In equation (8), the probability of a class given the current true state and the history of
observed states can be written as:
~
~ = c | X~ = x ∧ Π
P[ ψ
r kt
ki
o] =
~
~ ~
~
~ = c | X~ = x ∧ Π
P[ ψ
∑
r kt
ki
t ] P[Π t | X kt = x ki ∧ Π o ]
~
Πt ∈π t
(9)
is the random vector for the combination of previously observed variables
where Π
t
(without noise) and π t is the set of true state vectors.
Equation (9) demonstrates the difficulty of a Bayesian approach. The first term of the
right hand side of (9) must be computed O((p-d)(qd)) times where d is the node or input level
(root node is level 0) in a decision tree, p is the number of inputs, and q is the average number of
input states. The cardinality of π d is qd as π d contains all possible true states of all inputs on the
path. The joint probability calculations must be repeated for all remaining inputs (p-d). Since the
3
The chain propagation rule provides the relation: P ( A| B ) =
∑ j P( A| B, C j ) P(C j | B)
Page 14
depth of a tree is partially dependent on the number of inputs, a recursive Bayesian updating
approach is impractical for computational reasons.
As an alternative to a Bayesian updating approach, noise effects can be propagated during
tree construction by introducing noise into partitions by randomly scrambling cases using the
parameters C and W. This amounts to computing the class probabilities using a partition instead
of calculating the probabilities in (9). A formal description of the random scrambling procedure
follows.
Procedure Random Scrambling
Input
Z: a partition
X: branching input with q states
C: noise level
Output
S: a ‘scrambled’ set of partitions of input X, S = {S k S k is partition of S} k = 1,2,..., q
Procedure
1. Initialize each S k ∈ S to φ .
2. Let T be the set of true partitions resulting from splitting Z into q partitions on input X.
3. For each partition Tk ∈ T do
3.1. For each case τ ∈Tk do
3.1.1. Let r be a random number in [0,1].
3.1.2. If r ≤ C then set S k : = S k ∪ τ else randomly choose S j , j ≠ k and
set S j := S j ∪ τ .
In contrast to the Bayesian updating approach, the complexity of injecting noise through
random scrambling is O((p-d)q) because q scrambling operations for each input are necessary and
only the current input needs to be scrambled. The effect of noise from all but the current input on
the path is already included in the starting noisy partition.
Figure 2 illustrates the random scrambling procedure. The root box depicts a partition of
size 100 with 66 cases expected of class c1 and 34 cases expected of c2. Input Xk is selected and
partitioned by its three states where the expected size and class frequencies of the partitions are
Page 15
4
shown in the boxes without parentheses . Noise is then introduced by scrambling all the
partitions of Xk according to the correct noise probability C=0.8. For each case in a partition, a
random number is drawn to determine if the case should be moved to a partition associated with
another state of the input. If the number is less than or equal to C, the case remains in its original
partition. Otherwise the case is randomly moved to a partition associated with another state of
the input. After scrambling, the size and class frequencies of each partition are typically more
uniform because of the impact of noise. In Figure 2, the scrambled partitions are beneath the
clean partitions. The lines indicate that cases can be moved from a clean partition to any noisy
partition.
100
c1: 66, c2: 34
Xk
xk3
xk1
xk2
60
c1 : 48, (24.96)
c2 : 12, (10.56)
30
c1 : 10, (8.92)
c2 : 20, (16.06)
10
c1 : 8, (7.36)
c2 : 2, (1.96)
Initial Partitions
52
c1: 40.2, (25.43) {16.14}
c2: 11.8 (10.85) {6.94}
31
c1: 13.6, (12.66) {6.03}
c2: 17.4 (14.88) {10.4}
17
c1: 12.2, (11.54) {4.73}
c2: 4.8 (5.05) {1.52}
Scrambled Partitions
(C = 0.8)
Figure 2: Example of the Random Scrambling Process
4
Numbers inside round and curly brackets are the variances that are explained in subsection 3.4.
Page 16
The explicit algorithm ID3ep uses the random scrambling procedure to introduce noise into
partitions. First, the usual partitioning process of the ID3 algorithm is performed. Second, the
partitions created in the first step are scrambled using the random scrambling procedure. After all
partitions of an input are scrambled, the normal formulas for the input selection, stopping rule,
and classification function are used. Note that we retain the pessimistic pruning procedure in
ID3ep. Thus, the only difference between ID3p and ID3ep is the way that noise is treated. In ID3p,
the treatment of noise is indirect through a sample of the data collection process. In ID3ep, the
treatment of noise is direct through random injection of noise in a clean training set using the
parameters C and W. Despite the differences in handling noise, the expected behavior of both
algorithms is governed by equation (7). This observation is similar to the results of previous
research comparing input selection criteria [Mantaras 1991].
3.4. Impact of Noise on Variance
As discussed in subsection 3.3, explicit algorithms have no advantage over implicit
algorithms in terms of expected predictive ability. The advantage of explicit noise handling lies in
the potential to make performance more stable. In this section, we study the impact of noise on
the variance in class frequencies between a binomial noise process and a constant noise process.
The binomial noise process represents ID3p and ID3ep in which noise is randomly introduced
either in the training set or through the random scrambling process. The constant noise process
represents a controlled variation of ID3ep (ID3ecp) in which class frequencies in noisy partitions are
set to their estimated expected values. As the analysis elucidates, the variance in size due to noise
is much less in the constant noise process than in the binomial noise process.
We begin with the sampling variance common to both noise processes limiting our focus
to a single input, state, and class. Let,
P[ X kt = x kj ] = γ j and P[ ψ = cr | X kt = x kj ] = ρ rj be population parameters and
s( L ( Zn , X kt = x kj )) be the size of the partition Zn where X kt = x kj .
Page 17
The expected value and variance of the size of a partition of size n where X kt = x kj are
defined in equations (10) and (11), respectively. For each case, the sampling process selects
X kt = x kj with probability γj. Thus, the size of a partition is binomially distributed with mean and
variance as shown in equations (10) and (11).
E ( s( L( Zn , X kt = x kj ))) = nγ j
~ ~
V ( s( L( Z n , X kt = x kj ))) = nγ j (1- γ j )
(10)
(11)
To analyze the mean and variance of the sub partition size where ψ = cr , we need to apply
Wald's theorem [Ross 1970 -- page 37]. Wald's theorem defines the expected value and variance
~
~
of the sum of K random variables where K is also a random variable. The expected value of the
~
~
~
~
sum of K independent, identically distributed random variables ( X i ) is given by E ( K ) E ( X i ) .
~
~
~
~
The variance of the sum is given by E ( K )V ( X i ) + V ( K )( E ( X i )) 2 . Here, the expected size of the
beginning partition is nγ j as defined in (10). Within this partition, the probability of ψ = cr is
ρ rj . Applying Wald's theorem, the mean and variance of the sub partition size are defined by (12)
and (13). In (13), we assume that the class random variables of the individual cases within the
partition are independent and identically distributed.
E ( s( L( Zn , ψ = cr ∧ X kt = xkj ))) = nγ j ρ rj = µ rj
~ ~
~
= cr ∧ X kt = x kj ))) = nγ j ρ rj (1 − ρ rj ) + nγ j (1 − γ j )ρ 2rj = σ 2rj
V ( s( L( Z n , ψ
(12)
(13)
Now consider the effect of the binomial noise process common to ID3p and ID3ep. This
process introduces more variance in the class frequency because cases are assigned at random to
partitions based on the parameter C. The expected value and variance of the size of the above
partition after noise is given in equations (14) and (15) where the noise is a binomial process with
mean and variance (C, (1-C)C). Note that these formulations involve another application of
Wald's theorem because once again the size of the beginning partitions is uncertain as defined in
~
(12) and (13). In (14) and (15), nγ j ρ rj replaces E ( K ) for correct input states and
~
nγ i ρ ri replaces E ( K ) for q-1 incorrect input states. For correct input states, C replaces
Page 18
~
~
~
E ( X i ) and W replaces E ( X i ) for q-1 incorrect states. In (15), σ 2rj from (13) replaces V ( K ) for
~
correct input states and σ 2ri replaces V ( K ) for q-1 incorrect states. For correct input states, C(1~
~
C) replaces V ( X i ) and W(1-W) replaces V ( X i ) for q-1 incorrect states.
~ ~
~
E ( s( L ( Z n , ψ
= cr ∧ X ko = xkj ))) = nγ j ρ rj C +
q
∑ nγ i ρ riW
i≠ j
q
= Cµ rj + W
~ ~
~
V ( s( L( Z n , ψ
= cr ∧ X ko = x kj ))) =
(14)
∑ µ ri
i≠ j
(15)
q
2
C (1 − C )µ rj + C 2 σ rj
+ W (1 − W )
∑
i≠ j
q
µ ri + W 2
∑ σ ri2
i≠ j
Now consider the effect of the constant noise process associated with the controlled
scrambling algorithm ID3ecp. Here, the class frequency is set to its expected value. Therefore, the
variance due to noise disappears. However, the variance of the sampling process remains.
Equation (16) for the partition size with a constant noise process is derived using the variance of a
constant times a function of a random variable. Alternatively, the variance can be derived by
dropping the first and third terms in (15) because the constant noise process has zero variance.
Note that the expected value remains as defined in equation (14).
~ ~
~
V ( s( L( Z n , ψ
= cr | X ko = x kj ))) = C 2 σ 2rj + W 2
q
∑ σ 2ri
(16)
i≠ j
To further depict the difference between the binomial noise process and the constant noise
process, consider a numerical example based on Figure 2 where the population statistics are given
below.
~
~
~
P[ X kt = x k 1 ] = 0.6 , P[ X kt = x k 2 ] = 0.3 , P[ X kt = x k 3 ] = 01
.
~
~
~
~
~ = c | X~ = x ] = 0.8
P[ ψ = c1 | X kt = x k 1 ] = 08
. , P[ ψ = c1 | X kt = x k 2 ] = 0.33, P[ ψ
kt
k3
1
~
~
~
~
~
~
P[ ψ = c2 | X kt = x k 1 ] = 0.2 , P[ ψ = c2 | X kt = x k 2 ] = 0.67 , P[ ψ = c2 | X kt = x k 3 ] = 0.2
The labeling of the boxes in Figure 2 shows the expected values for the input state and
class in the clean and noisy partitions. The variances for the binomial (equation (15)) and
Page 19
constant noise (equation (16)) processes are shown in round and curly brackets, respectively, next
to the expected values (equation (14)) in the noisy partitions. Note the reductions for the
constant noise process.
Figure 3 shows a graphical view of the difference between the constant and binomial noise
processes for a particular class. Here there are two curves corresponding to partitions at level 4
in a decision tree. The curves for the binomial noise process are based on (15), while the curve
for the constant noise process is based on (16). For the binomial noise process, Figure 3
demonstrates that the coefficient of variation increases as the noise level increases with the peak
about 1/q. Although not shown here, the difference between the coefficient of variation of the
two processes increases as the depth of the tree increases because observing more noisy inputs
introduces additional uncertainty.
1.8
Binomial Noise
Process
1.6
1.4
1
Constant Noise
Process
0.8
0.6
0.4
0.2
C
Figure 3: Graphical Comparison of Noise Processes of a Node at Level 4
1
0.95
0.9
0.85
0.8
0.75
0.7
0.65
0.6
0.55
0.5
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0
CV Size
1.2
Page 20
The goal of reducing the variance of the sub-partition sizes is to make the input selection
process more stable and ultimately to reduce the variance in performance on a set of unseen cases.
Analytically, it is rather difficult to measure the variance in the gain because distribution
assumptions are necessary and the mathematics is complicated. Since any distribution
assumptions would rarely be met, it seems better to make a strong statement about the subpartition sizes rather than a weak statement about the gain. Even with a strong statement about
the variance of the gain, simulation experiments are still necessary to link the theoretical behavior
with performance on unseen cases.
3.5. Controlled Scrambling Procedure
The controlled scrambling procedure is designed to behave as close as possible to the
constant noise process and thereby to reduce the variance in the gain and ultimately, the
performance on unseen cases. The differences between the controlled scrambling procedure and
the constant noise processes are due only to rounding of fractional values and conserving the
number of cases in various partitions of the training set. The controlled scrambling procedure first
computes the class and true input state frequencies as close as possible to their estimated values
and then randomly assigns cases to match the computed frequencies. Control of class frequencies
follows from the discussion in Section 3.4.
The reason for controlling the true state frequency is a little subtle, however. The true
state distribution shows the fraction of each true state within a noisy partition. For example in
Figure 2, the true state frequencies in the noisy partition xk1 are 48 (60*0.8) for true state xk1, 3
(30*0.1) for true state xk2, and 1 (10*0.1) for true state xk3. Controlling the true state frequency
does not affect the variance of the class frequency in the current node, but rather it potentially
impacts the class frequencies in descendant nodes. The true state frequency is important to
control when the next input to select in the decision tree is conditionally dependent on the current
input state. When there is a strong dependence, reducing the variance in true state frequencies
Page 21
will reduce variance in the class frequencies of the next input. For example, the constant noise
curve in Figure 3 was generated with a strong dependence between the current input state and the
next input state. Thus, controlling the true state frequencies is a matter of reducing variance in
descendant nodes rather than in the current node.
Computation of the class and true state frequencies is accomplished by solving two
optimization models. First, the controlled scrambling procedure assigns class frequencies such
that the assignment minimizes the distance between the assigned class frequencies and estimated
class frequencies in a set of partitions (all the partitions of an input) subject to integer, nonnegativity, and case conservation constraints. The latter constraints ensure that the total number
of cases is the same before and after allocation by the controlled scrambling procedure. Second,
the controlled scrambling procedure assigns true state frequencies such that the assignment
minimizes the distance between the assigned true state frequencies and the estimated true state
frequencies in a subset of a partition (all cases of the same class) subject to integer, nonnegativity, and case conservation constraints. The latter constraints are based on the assignment
made by the class frequency optimization model.
Let
d ( cr , j ) be the assigned frequency of class cr in the noisy partition where X ko = xkj .
s( cr , L ( Z , X kt = x kj )) be the frequency of class cr in the partition Z where X kt = x kj .
e(cr , j ) be the estimated frequency of class cr in the noisy partition where X ko = xkj .
q
e(cr , j ) = C * s(cr , L( Z , X kt = x kj )) + W
∑ s(cr , L( Z , X kt = x ki ))
i≠ j
d ( xki , r , j )be the assigned frequency of true state xki in the noisy partition where
X ko = xkj and the class is cr.
e( x ki , r , j ) be the estimated frequency of true state xki in the noisy partition where
X ko = xkj and the class is cr.
e( x ki , r , j ) = C * s(cr , L( Z , X kt = x ki )) if i = j
e( x ki , r , j ) = W * s(cr , L( Z , X kt = x ki )) if i ≠ j
Page 22
The class frequency optimization model solves m optimization problems CFr, r=1,...,m.
In each problem, there are q decision variables, d ( cr , j ) , j=1,...,q.
Problem CFr :
 q

2

Min
(d (cr , j ) − e(cr , j)) 

 j =1

s. t. d (cr , j ) ≥ 0 and integer, ∀ j = 1,..., q
∑
q
∑
(17)
q
d ( cr , j ) =
j =1
∑ s(cr , L( Z , X kt = x kj ))
- - case conservation constraint
j =1
The true state frequency optimization model solves mq optimization problems TSFrj,
r=1,...,m, j=1,...,q (one for each combination of class and observed state). In each problem, there
are q decision variables, d ( xki , r , j ), i=1,...,q.
Problem TSFrj :
 q
2
Min (d ( x ki , r , j ) − e( x ki , r , j )) 
 i =1

s. t. d ( x ki , r , j ) ≥ 0 and integer ∀i = 1,..., q
∑
(18)
q
∑ d ( x ki , r , j) = d * (cr , j)
i =1
- - case conservation constraint
where d * (cr , j ) is the optimal value of d (cr , j ) in CFr
The controlled scrambling procedure initially assigns class frequencies (d ( cr , j ) ) by
rounding the estimated class frequencies to the nearest integer value. If the case conservation
constraint is not satisfied for a set of assigned class frequencies, the assigned class frequencies are
adjusted in the order that minimizes the distance from the estimated class frequency until the
constraint is satisfied. A similar procedure is used for the assigned true state frequencies. More
precisely, the algorithm to compute the assigned class frequencies is shown in procedure Assign
Class Frequencies.
Page 23
Procedure Assign Class Frequencies
Input
Z : a partition
X k : branching input with q states
C: noise level
Output
D: a set of assigned class frequencies for Z and Xk,
D = {d (cr , j ) d (cr , j ) is an assigned class frequency r = 1,2,..., m; j = 1,2,..., q}
Procedure
1.
For each class cr and each state xkj of Xk, set d ( cr , j ) = rounded ( e( cr , j )) .
2.
For each set of assigned frequencies of a class ({d ( cr , j ) | j = 1... q} ), adjust them such that
the case conservation constraint is satisfied. Stop when the case conservation constraints
are satisfied for all sets of assigned class frequencies.
2.1 If
q
q
j =1
j =1
q
q
j =1
j =1
∑ d (cr , j ) > ∑ s(cr , L( Z , X kt = xkj )) , then adjust by taking away from some
d ( cr , j ) . Sort d ( cr , j ) by descending order of (d (cr , j ) − e (cr , j )) . Starting from the
d ( cr , j ) with the largest difference, compute new d ( cr , j ) = old d ( cr , j ) - 1 until the
case conservation constraint is satisfied.
2.2 If
∑ d (cr , j ) < ∑ s(cr , L( Z , X kt = xkj )) , then adjust by adding to some d (cr , j ) .
Sort d ( cr , j ) by ascending order of (d (cr , j ) − e (cr , j )) . Starting from the d ( cr , j )
with the smallest difference, compute new d ( cr , j ) = old d ( cr , j ) + 1 until the case
conservation constraint is satisfied.
The procedures to assign the class and true state frequencies are optimal and polynomial in
the number of states and classes. The procedures start with the best assignment (rounded
estimated values) and adjust the best assignment until a feasible value is obtained. The
adjustments are always the smallest deviation from the best assignment. Because there is no
interaction among the decision variables, the resulting assignment minimizes the sum of squared
deviations subject to case conservation constraints. The worst case complexity of the class
frequency assignment algorithm is O(mqlog(q)) for each input because there are m lists of
decision variables where each list must be sorted (qlog(q)). The operations of computing the
rounded estimates and adjusting the estimates can be performed in time linear to the number of
Page 24
states. Similarly, the worst case complexity for the true state frequency assignment algorithm is
O(mq2log(q)) because there are mq decision variables. The total worst case complexity is the sum
of the above worst case complexities because the two procedures are independently performed.
The controlled scrambling procedure uses the algorithms to assign class and true state
frequencies. After the class and true state frequency assignments are made, the controlled
scrambling procedure randomly selects sets of cases from the true partitions and assigns them to
scrambled partitions to satisfy the class and true state frequencies. In contrast, the random
scrambling procedure assigns individual cases to scrambled partitions without the constraints of
the class and true state frequencies. Formally, the algorithm to introduce noise in a controlled
manner is presented in procedure Controlled Scrambling.
Procedure Controlled Scrambling
Input
Z : a partition
X k : branching input with q states
Output
S: a ‘scrambled’ set of partitions of input X, S = {S j S j is partition of S} j = 1,2,..., q
Procedure
1. Perform procedure Assign Class Frequencies.
2. Perform procedure Assign True State Frequencies.
3. Let T be the set of true partitions resulting from splitting Z into q partitions on input X k .
T = {Ti Ti is a partition of T where X k = x ki } .
4. For each j = 1,2,...,q do
4.1. For each i = 1,2,...,q do
4.1.1 For each r = 1,2,...,m do
4.1.1.1 Define Tir , such that Tir is a partition of Ti where the class is cr
4.1.1.2. Let S jr be d ( xki , r , j ) randomly selected cases without replacement from
Tir . S j = S j ∪ S jr
The explicit algorithm ID3ecp uses the controlled scrambling procedure to introduce noise
into partitions. First, the usual partitioning process of the ID3 algorithm is performed. Second,
the partitions created in the first step are scrambled using the controlled scrambling procedure.
Page 25
After all partitions of an input are scrambled, the normal formulas for the input selection, stopping
rule, and classification function are used. Note that we retain the pessimistic pruning procedure in
ID3ecp.
4. Comparison of Algorithm Performance
In this section, we describe simulation experiments that investigate the performance of the
implicit (ID3p) and explicit algorithms (ID3ecp) in terms of average accuracy and coefficient of
variation (CV) of accuracy. We describe the hypotheses, performance measures, data sets,
methodology, and results.
4.1. Hypotheses
The hypotheses extend the analytical results of Section 3 to the performance of the
implicit and explicit algorithms. Section 3 established several relationships between the implicit
and explicit noise handling approaches: (i) same expected behavior on the gain, (ii) lower variance
in the class frequency of a partition for a constant noise process than a binomial noise process,
and (iii) difference in variance increases as noise level increases. Because it is difficult to
analytically link the performance of the algorithms to the noise handling approach, simulation
experiments are necessary. We feel that these three results will extend to the performance of the
implicit and explicit algorithms as stated below.
Hypothesis 1: There is no difference in expected performance between the implicit algorithm
ID3p and the explicit algorithm ID3ecp.
Hypothesis 2: The explicit algorithm (ID3ecp) has a smaller coefficient of variation in
performance than the implicit algorithm (ID3p).
Hypothesis 3: The difference in the coefficient of variation in performance of the explicit
algorithm (ID3ecp) minus the implicit algorithm (ID3p) increases as the noise level increases
over a range of reasonable noise levels.
Page 26
4.2. Performance Measurement
We use two measures of performance: classification accuracy, a traditional measure and
relative information score, a more refined measure. Classification accuracy (ratio of correctly
classified cases to total cases) does not account for the effects of the number of classes and the
prior probabilities of each class. The Relative Information Score (RIS) [Kononenko and Bratko
1991] measures the percentage of the uncertainty of the data set that is explained by the learning
algorithm. Thus, a high RIS value is preferred to a low value. In equation (19), RIS is computed
as the amount (i.e., the number of bits) of uncertainty removed by the classification process (Ia)
divided by the amount of uncertainty in the data set before classification (the entropy of the
distribution E).
I
RIS = a * 100%
E
(19)
where Ia is the Average Information Score computed as
1 n
Ia = * ∑ I j
T 1
where T is the size of the test set and Ij is the information score of case j where
 − log 2 P(C ) if a correct classification is made
Ij =
 log 2 (1 − P(C )) if an incorrect classification is made
where P(C) is the prior probability of class C (determined from the data set).
4.3. Data Sets
To provide a range of case distributions, we executed the experiments with five data sets,
two real and three artificially generated. The Bankruptcy data set [Liang 1992] contains 50 cases,
8 inputs, and 2 equally distributed classes (bankrupt or healthy). The Lymphography data set
[Murphy and Aha 1991] was developed through data collection at the University Medical Centre,
Institute of Oncology in Ljubljana, Yugoslavia. To reduce the effects of spurious noise, we
removed cases with missing values, removed redundant cases, and removed all but one among
Page 27
conflicting cases. After this cleansing activity, the Lymphography data set contains 148 cases
distributed among 4 classes where two classes are very sparse (2 and 4 cases, respectively). The
Lymphography data set contains 18 inputs with an average of 3.3 states per input where the
number of states ranges from 2 to 8.
The artificial data sets were generated by a program based on the specifications described
in [Bisson 1991]. The data set generator can control the number of cases, classes, inputs, states
per input, the distribution of cases among classes, and the complexity of the true rule sets for each
class. Data set 1 contains 4 equally distributed classes with 10 inputs. Data set 2 contains 8
moderately skewed classes and 15 inputs. Data set 3 contains 12 highly skewed classes with 20
inputs. In data set 2, two classes have 50% of the cases and the remainder of the cases are
uniformly distributed among the other 6 classes. In data set 3, 80% of the cases are distributed to
3 classes and the remainder are uniformly distributed to the other 9 classes. In the artificial data
sets, the number of cases was 200, the average number of states per input was 3, and the average
size of the true rule sets was 2 rules per class and 3 conjunctive terms per rule.
4.4. Experimental Design
5
Table 1 shows a 2 X 3 factorial design for the algorithm and noise level factors. The
numbers in the cells show the observations for each treatment. As shown in Table 1, we choose 3
levels of noise: high (C = 0.65), moderate (C = 0.80), and low (C = 0.95). The low level of noise
is close to perfect measurement. The moderate level of noise causes a significant decrease in
predictive performance. The high noise level causes a further significant decrease in predictive
performance. Further decline in predictive performance from higher levels of noise is not as
5
The algorithms and simulation experiment were implemented using Microsoft C on a 486 personal
computer.
Page 28
significant. In separate experiments, the entire range of noise levels is studied to graphically
depict the functional relationship between noise and the mean and variance of performance.
There are two experiments corresponding to the dependent variable average RIS and
coefficient of variation (CV) of RIS. The method used to estimate performance is the standard
test-sample method [Breiman et al. 1984]. Each observation is the average performance of the
same 30 splits of a data set where the data set is divided roughly 70% for training and 30% for
testing. Each cell of both experiments uses the same set of 100 noise perturbations. In a
perturbation, the data set is randomly changed using the given C and W values. ID3p is given a
perturbed training set while ID3ecp is given a clean training set. Both algorithms use the same test
set and the same perturbed training set in the pessimistic pruning procedure. In the average
experiment, all observations are used. In the CV experiment, each cell contains the same 30
6
random samples of size 30 from the 100 observations. An observation is the CV computed from
the given sample.
Algorithm
ID3p
ID3ecp
Table 1: Experimental Design
Noise Level (C)
L (.95)
M (.80)
100 (AVG)
100 (AVG)
30 (CV)
30 (CV)
100 (AVG)
100 (AVG)
30 (CV)
30 (CV)
H (.65)
100 (AVG)
30 (CV)
100 (AVG)
30 (CV)
4.5. Results
Analysis of variance was used to determine whether there are performance differences
between the algorithms across the 5 data sets. Tables 2 - 5 report the analysis of variance results
for the Lymphography data set and data set 1 (GD1). The ANOVA tables show that the
6
A sample is a random selection of 30 perturbations. Each cell of the CV experiment contained the same
30 perturbations.
Page 29
simulation results are consistent with the theoretical analysis in Section 3. The noise level (NL)
affects both dependent variables (AVG RIS and CV RIS), but the algorithm (ALG) and the
interaction term (NL*ALG) are not significant for AVG RIS (Tables 2 and 4). However, ALG
and the interaction term are significant for CV RIS (Tables 3 and 5). Thus, the ANOVA results
confirm Hypothesis 1 and support Hypotheses 2 and 3. Similar results were obtained for the
other data sets (see Table 6).
Tables 7 and 8 list mean CV differences between the algorithms (first order differences) at
each noise level and mean differences between the algorithms at adjacent noise levels (second
order differences). Because both the first order and second order differences are significant,
Hypotheses 2 and 3 are confirmed.
Table 2: AVG Lymphography ANOVA Results
SOURCE
SS
df
MS
F
NL
ALG
NL*ALG
WITHIN
TOTAL
73785.8149
39.7113688
31.9080444
17310.1319
91167.5661
2
1
2
594
599
36892.9074
39.7113688
15.9540222
29.1416361
P-value
1265.98614
1.3627021
0.54746488
SOURCE
Table 3: CV Lymphography ANOVA Results
SS
df
MS
F
P-value
NL
ALG
NL*ALG
WITHIN
TOTAL
1.72273569
0.52293785
0.16348104
0.22173633
2.63089091
SOURCE
NL
ALG
NL*ALG
WITHIN
TOTAL
2
1
2
174
179
0.86136785
0.52293785
0.08174052
0.00127435
675.928939 9.1635E-83
410.357584 1.212E-47
64.1430741 1.3533E-21
Table 4: AVG GD1 ANOVA Results
SS
df
MS
F
230805.371
0.74741892
4.5854718
8725.97472
239536.678
2
1
2
594
599
115402.685
0.74741892
2.2927359
14.6901931
6.368E-215
0.24353811
0.57870587
7855.76366
0.05087877
0.15607255
P-value
0
0.82161873
0.85553219
Page 30
SOURCE SS
NL
ALG
NL*ALG
WITHIN
TOTAL
Table 5: CV GD1 ANOVA Results
df
MS
F
0.26582049
0.0891291
0.01144605
0.04442473
0.41082037
2
1
2
174
179
0.13291025
0.0891291
0.00572303
0.00025531
520.574572
349.095311
22.4155911
P-value
3.6729E-74
1.9006E-43
2.1791E-09
Table 6: P-Values for Other Data Sets
Bankruptcy
GD2
NL
ALG
NL*ALG
AVG
1.5053E-62
0.6006
0.9231
CV
5.7346E-56
2.2906E-68
1.7721E-33
AVG
0.0000
0.8216
0.8555
CV
2.3041E-95
9.0516E-64
2.3986E-22
GD3
AVG
0.0000
0.4669
0.0225
CV
4.6106E-76
2.5253E-39
3.0017E-18
Table 7: Lymphography CV Mean Differences
ID3p
ID3ecp
t-value
p-value
H
M
L
M-H
L-M
H
M
L
M-H
L-M
0.429047734
0.268555164
0.11611841
-0.160492571
-0.152436757
0.251359557
0.153436962
0.08552477
-0.097922595
-0.067912194
11.53771308
15.87069539
6.540714774
-3.638112396
-10.34340997
1.16936E-12
3.88396E-16
1.8278E-07
0.000529124
1.52555E-11
Table 8: GD1 CV Mean Differences
ID3p
ID3ecp
t-value
p-value
0.1812388
0.1197556
0.0678872
-0.061483251
-0.051868367
8.071E-13
2.786E-16
3.47781E-14
0.009954073
2.19337E-07
0.1180832
0.0735932
0.043692
-0.044490058
-0.029901399
11.717007
16.073779
13.31519161
-2.464027269
-6.473220833
4.6. Discussion
To depict the magnitudes of the differences in performance, we generated simulation data
to graphically compare the performances. In a simulation run, each observation was computed
from the same 20 splits and 20 perturbations. In Figures 4 and 5, the average performance graphs
of GD1 almost coincide as expected. In Figures 6 and 7, the average performance graphs for the
Lymphography data set are more erratic as the graphs cross numerous times. This slightly erratic
Page 31
behavior is probably due to the increased level of residual variation in the Lymphography data set
as its maximum RIS is slightly less than 50% compared to more than 80% for GD1. The shape of
the average performance graphs provides evidence to support Proposition 1 in Section 3.2. Note
that the average RIS and accuracy is minimized near 1/q and the bowl-like shape of the curves.
For GD1, the average number of states is just below 3. For the Lymphography data set, the
average number of states is 3.3 and there is a larger variation in the number of states (2 of the
inputs have 8 states). Because ID3 favors inputs with many states, the inputs with 8 states
probably appear on most paths in the decision tree. This explains why the Lymphography graph is
minimized below the simple average number of states from the data set.
As for the CV of performance, Figures 4 through 7 show a strong separation between the
implicit and explicit algorithms. In addition, the CV of performance increases as noise increases
and the difference between the explicit and implicit algorithms increases as the noise level
increases. The shape of the curves and their extreme point is consistent with the theoretical
graphs shown in Figure 3. There is larger separation between the explicit and implicit graphs in
the Lymphography data set (Figures 6 and 7) than in the GD1 (Figure 4 and 5). The difference at
low levels of noise is obscured in Figures 6 and 7 because the scale is wider. However, the
difference between the plotted points is larger for the Lymphography data set than GD1 even at
low levels of noise. Note also that the shape of the CV graphs in Figures 6 and 7 differ because
the CV RIS graph (Figure 6) is affected by some negative RIS values.
To probe the sensitivity of ID3ecp, we generated additional simulation data for GD1.
Here, the true noise level (C) used in the test set and in the training set of ID3p differed from the
false noise parameters given to ID3ecp (C'). In the constant error sensitivity runs, the false noise
parameter C' was misstated by 0.10 (either all high or all low in a run) except near the end points
(C = 0 or 1). In varying error sensitivity runs, the mean of the false noise level C' was equal to
the true noise level C but variance in a range of ±0.10 or ±0.05 of the true C value was
Page 32
introduced. Somewhat surprisingly, the average performance of the algorithms in each sensitivity
run shows little difference than in Figure 4. In Figure 8, the CV of performance continues to
demonstrate a large advantage for ID3ecp for both constant error runs (high and low). In Figure
9, the CV advantage of ID3ecp has disappeared at high values of C in the ±0.10 case. In the ±0.05
case, however, there is still a marked advantage for ID3ecp. Figure 9 demonstrates that CV of
performance of ID3ecp is relatively sensitive to accurate noise level assessments as compared to an
implicit algorithm using a training set drawn from the same noise process as unseen cases. Thus,
the stable behavior of explicit algorithms may deteriorate due to the variance caused by incorrect
specification of the noise level. When the variance in incorrect specification is high, the noise
variance acting on implicit algorithms may be balanced by the incorrect specification variance
acting on explicit algorithms.
5. Summary and Conclusions
We compared decision tree induction algorithms under conditions of noisy input data
where the level of noise was either implicitly known through a sample of the noise process or
explicitly known through an external parameter. Explicit measurement of noise is cost effective
when training data is provided directly by experts or when the cost of directly estimating the level
of noise is less than the cost of sampling a representative noisy process. In addition, the appeal of
explicit noise measurement broadens when there is an associated performance benefit. Here, we
demonstrated that explicit noise measurement can be accompanied by more stable performance on
unseen cases.
Our primary contributions were to develop an explicit noise handling algorithm (ID3ecp)
and demonstrate its advantages as compared to a standard implicit noise algorithm (ID3 with the
pessimistic pruning procedure). ID3ecp injects random noise into partitions but controls the class
and true state frequencies in a partition as close as possible to their estimated values. The aim of
Page 33
the controlled scrambling procedure is to reduce the variance in the partitioning behavior. We
demonstrated that the implicit and explicit algorithms have the same expected behavior and
performance, but the explicit algorithm has more stable behavior and performance. The
behavioral results were demonstrated by an analytical comparison of a constant noise process as
the best case for a controlled partitioning process and a binomial noise process for implicit noise
measurement. The performance of the algorithms was demonstrated by simulation experiments to
compare the average performance and coefficient of variation of performance for the two
algorithms.
This research is part of our long term interest concerning the economics of expert systems.
Two direct extensions of this work are treating the noise level as a decision variable rather than a
constraint and developing induction algorithms that combine mean and variance of performance.
In the former topic, explicit measurement of noise is required to make a tradeoff among the cost
of removing noise with the benefit of improved decision making. Other topics not directly related
to this work are optimizing expert system performance over a multi-period horizon and
developing cost-benefit objectives for other approaches such as Bayesian reasoning networks.
Page 34
Appendix A
Convexity of Gain with Respect to Noise Level
Proposition 1
~ ~
∂g ( X ko | Z C )
> 0, C ∈ (1 / q ,1]
∂C
= 0, C = 1 / q
< 0, C ∈[0,1 / q )
(6.1)
(6.2)
(6.3)
We prove (6.2) and demonstrate the truth of (6.1) and (6.3) for a special case. Let us
begin by showing the second condition, namely, the slope is zero at C = 1/q. With some algebra,
first derivative of the gain function with respect to C (A1) can be derived.
m q
pr + α ( qp j prj − pr )
∂g ( X ko | Z C )
1
=
( qp j prj − pr ) log 2
∂C
q − 1 r =1 j =1
1 + α ( qp j − 1)
where ,
p j = P ( X kt = x kj | Z C )
pr = P ( ψ = cr | Z C )
prj = P( ψ = cr | Z C ∧ X kt = x kj )
α = (1 − Wq )
∑∑
(A1)
For C = 1/q, α = 0. Substituting α = 0 in the above equation for the slope we obtain:
∂g ( X ko | Z C )
1
=
∂C
1− q
q
m
∑
r =1
log 2 ( pr )
∑
q
( qp j prj − pr ) = 0 as
j =1
The second condition in the proposition is proved.
∑ (qp j prj − pr ) = qpr − qpr = 0
j =1
†
For (6.1) and (6.3), we show that the second derivative is greater than zero in a special
case. With some algebra, the second derivative of the gain with respect to W can be derived as
(A2). We must show that the second derivative is greater than 0 as stated in (A3).
Page 35
∂ 2 g ( X ko | Z C )
∂ 2W
where W
δ rj
δj
q
q
=−
3
δ rj pr + qδ rj
m
q
δ j (1 − qδ j )
∑ ∑ Wp r + (1 − Wq )δ rj + ∑ W + (1 − Wq )δ j
j =1 r =1
= q1-−C1
= qp j prj − pr
= qp j − 1
δ j (1 − qδ j )
8
q
m
j =1
3
δ rj pr + qδ rj
8
∑ W + (1 − Wq )δ j ∑ ∑ Wp r + (1 − Wq )δ rj
>
j =1
(A2)
(A3)
j =1 r =1
Because
∑r δ rj = δ j ,we assume that one of the δrj terms equals δj and the other m-1
terms be divided into pairs such that the sum of each pair is zero and each pair member is the
same absolute value. In addition, we assume that pr (= µ) is constant for all r. Using these
assumptions and some algebra, the right hand side of (A3) can be reduced to a quantity less than
(A4).
q
δ j (µ − qδ j )
∑ W µ + (1 − Wq )δ j
(A4)
j =1
Substituting (A4) into (A3) and using some further algebra, (A3) can be reduced to (A5).
q
δ2 j
>
D
j =1
∑
q
δ2 jµ
where D = (W µ + (1 − Wq )δ j )(W + (1 − Wq ) δ j )
D
j =1
∑
(A5)
(A5) is true because δ 2j > 0 and µ < 1 and D > 0. This proves (6.1) and (6.3) for the
special case stated above.
†
Page 36
Appendix B
Pessimistic Pruning Procedure
Appendix B describes the pruning procedure used in the algorithms ID3p and ID3ecp. This
description has been adapted from [Quinlan 1987].
For any given tree T, generated using a training set of N cases, let some leaf in the tree
account for K of these cases with J of them misclassified. The ratio J/K does not provide a
reasonable estimate of the error rate when classifying unseen cases [Quinlan 1987]. A more
reasonable estimate is obtained using the continuity correction factor for the binomial distribution,
wherein J is replaced by J+0.5 [Snedecor and Cochran 1980].
Let S be a subtree of T with LS leaves, and let JS and KS be the corresponding sums of
errors and cases classified over S. Using the continuity correction factor, the expected number of
cases (MS) misclassified by S out of KS unseen cases should be:
M S = J S + 0.5 LS
The standard error of MS, Se(MS), is given by:
Se( M S ) =
M S * ( KS − M S )
KS
Let E be the number of cases misclassified out of KS if the subtree S is replaced by its best
leaf. The pessimistic pruning procedure replaces S by its best leaf if:
E + 0.5 ≤ M S + Se( M S )
In pessimistic pruning, all non-leaf subtrees are examined only once and subtrees of
pruned subtrees need not be examined at all.
Page 37
References
Bisson, H. "Evaluation of Learning Systems: An Artificial Data-Based Approach," in Proceedings
of the European Working Session on Machine Learning, Y. Kodratoff (ed.), SpringerVerlag, Berlin, F.R.G., 1991.
Breiman, L., Friedman, J., Olshen, R., and Stone, C. Classification and Regression Trees,
Wadsworth Publishing, Belmont, CA, 1984.
Biemer, P. and Stokes, L. "Approaches to the Modeling of Measurement Error," in Measurement
Errors in Surveys, Chapter 24, Biemer, P., Groves, Lyberg, L., Mathiowetz, N., and
Sudman, S. (eds.), John Wiley & Sons, New York, 1991, pp. 487-516.
Christie, A. "Induction of Decision Trees from Noisy Examples," AI Expert, May 1993, 16-21.
Cestnik, B. and Bratko, I. "On Estimating Probabilities in Tree Pruning," in Proceedings of the
European Working Session on Machine Learning, Porto, Portugal, Springer-Verlag,
March 1991, pp. 138-150.
Creecy, R., Masand, B., Smith, S., and Waltz, D. "Trading MIPS and Memory of Knowledge
Engineering," Communications of the ACM 35, 8 (August 1992), 48-64.
Fuller, W. "Regression Estimation in the Presence of Measurement Error," in Measurement
Errors in Surveys, Chapter 30, Biemer, P., Groves, Lyberg, L., Mathiowetz, N., and
Sudman, S. (eds.), John Wiley & Sons, New York, 1991, pp. 617-636.
Groves, R. "Measurement Error Across Disciplines," in Measurement Errors in Surveys, Chapter
1, Biemer, P., Groves, Lyberg, L., Mathiowetz, N., and Sudman, S. (eds.), John Wiley &
Sons, New York, 1991, pp. 1-28.
Hill, D. "Interviewer, Respondent, and Regional Office Effects," in Measurement Errors in
Surveys, Chapter 23, Biemer, P., Groves, Lyberg, L., Mathiowetz, N., and Sudman, S.
(eds.), John Wiley & Sons, New York, 1991, pp. 463-486.
Holsapple, C. and Whinston, A. "Business Expert Systems, Irwin, Homewood, Illinois, 1987.
Irani, K., Cheng, J., Fayyad, U., and Qian, Z. "Applying Machine Learning to Semiconductor
Manufacturing," IEEE Expert 8, 1 (February 1993), 41-47.
Kononenko, I. and Bratko, I. "Information-Based Evaluation Criterion for Classifier's
Performance," Machine Learning, 6, 1991, 67-80.
Laird, P. Learning from Good and Bad Data, Kluwer Academic Publishers, Norwell, MA, 1988.
Liang, T. "A Composite Approach to Inducing Knowledge for Expert System Design,"
Management Science 38, 1 (1992), 1-17.
Mantaras, R.. "A Distance-Based Attribute Selection Measure for Decision Tree Induction,
Machine Learning 6, 1991, 81-92.
Mingers, J. "An Empirical Comparison of Pruning Methods for Decision Tree Induction, Machine
Learning 4, 2, 1989, 227-243.
Page 38
Moulet, M. "Using Accuracy in Scientific Discovery," in Proceedings of the European Working
Session on Machine Learning, Porto, Portugal, Springer-Verlag, March 1991, pp. 118136.
Murphy, P. and Aha, D. UCI Repository of Machine Learning Databases, University of
California, Irvine, Department of Information and Computer Science, 1991.
Mookerjee, V. and Dos Santos, B. "Inductive Expert System Design: Maximizing System Value,"
Information Systems Research (in press), 1993.
Niblett, T. and Bratko, I. "Learning Decision Rules in Noisy Domains," in Research and
Development in Expert Systems, (Proceedings of the Sixth Technical Conference of the
BCS Specialist Group on Expert Systems, Brighton, U.K., 1986.
Pearl, J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference,
Morgan Kauffman Publishers, San Mateo, CA, 1988.
Quinlan, J. "Induction of Decision Trees," Machine Learning, Vol. 1, 1986, 81-106.
Quinlan, J. "The Effect of Noise on Concept Learning," in Machine Learning, Vol. 2, Eds. R.
Michalski, J. Carbonnell, and T. Mitchel, Palo Alto, CA, Tioga Press, 1986, pp. 149-166.
Quinlan, J. "Simplifying Decision Trees," International Journal of Man Machine Studies, Vol.
27, 1987, 221-234.
Ross, S. Applied Probability Models, Holden-Day, San Francisco, CA, 1970.
Rao, J. and Thomas, R. "Chi-Squared Tests with Complex Survey Data Subject to
Misclassification Error," in Measurement Errors in Surveys, Chapter 31, Biemer, P.,
Groves, Lyberg, L., Mathiowetz, N., and Sudman, S. (eds.), John Wiley & Sons, New
York, 1991, pp. 637-664.
Shannon, C. and Weaver, W.. The Mathematical Theory of Communication, University of Illinois
Press 1949, (published in 1964).
Snedecor, C. and Cochran, W. Statistical Methods, 7th edition, Iowa State University Press,
1980.
Tam, K. and Kiang, M. "Predicting Bank Failures: A Neural Network Approach," Applied
Artificial Intelligence, 4, 1990, 265-282.
Page 39
Generated Data Set 1 - Average Performance (RIS)
Avg RIS
90
80
70
60
50
ID3p
ID3ecp
40
30
20
10
0
0
0.2
0.4
0.6
0.8
1
C
CV RIS
Generated Data Set 1 - Variance of Performance (RIS)
0.5
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
ID3p
ID3ecp
0
0.2
0.4
0.6
0.8
C
Figure 4: RIS Performance Graphs for Generated Data Set 1
1
Page 40
Generated Data Set 1 - Average Performance (Accuracy)
0.7
0.6
0.5
ID3p
ID3ecp
0.4
0.3
0.2
0.1
0
0
0.2
0.4
0.6
0.8
1
C
Generated Data Set 1 - Variance of Performance (Accuracy)
CV ACC
Avg ACC
0.9
0.8
0.5
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
ID3p
ID3ecp
0
0.2
0.4
0.6
0.8
C
Figure 5: Accuracy Performance Graphs for Generated Data Set 1
1
Page 41
50
45
40
35
30
25
20
15
10
5
0
ID3p
ID3ecp
0
0.2
0.4
0.6
0.8
1
C
Lymphography Data Set - Variance of Performance (RIS)
2.5
2
CV RIS
Avg RIS
Lymphography Data Set - Average Performance (RIS)
1.5
ID3p
ID3ecp
1
0.5
0
0
0.2
0.4
0.6
0.8
C
Figure 6: RIS Performance Graphs for the Lymphography Data Set
1
Page 42
Lymphography Data Set - Average Performance (Accuracy)
0.75
0.7
0.6
ID3p
ID3ecp
0.55
0.5
0.45
0.4
0
0.2
0.4
0.6
0.8
1
C
Lymphography Data Set - Variance of Performance (Accuracy)
CV ACC
Avg ACC
0.65
0.1
0.09
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
ID3p
ID3ecp
0
0.2
0.4
0.6
0.8
1
C
Figure 7: Performance Graphs (Accuracy) for the Lymphography Data Set
Page 43
ID3ecp Sensitivity in Generated Data Set 1 - Constant Low ( -0.1)
0.4
0.35
0.25
ID3p
ID3ecp
0.2
0.15
0.1
0.05
0
0
0.2
0.4
0.6
0.8
1
C
ID3ecp Sensitivity in Generated Data Set 1 - Constant High (+0.1)
0.4
0.35
0.3
CV RIS
CV RIS
0.3
0.25
ID3p
ID3ecp
0.2
0.15
0.1
0.05
0
0
0.2
0.4
0.6
0.8
C
Figure 8: Constant Sensitivity Graphs for Generated Data Set 1
1
Page 44
ID3ecp Sensitivity in Generated Data Set 1 - Varying [-0.1, +0.1]
0.45
0.4
0.35
0.25
ID3p
ID3ecp
0.2
0.15
0.1
0.05
0
0
0.2
0.4
0.6
0.8
1
C
ID3ecp Sensitivity in Generated Data Set 1 - Varying [-0.05, +0.05]
0.45
0.4
0.35
0.3
CV RIS
CV RIS
0.3
0.25
ID3p
ID3ecp
0.2
0.15
0.1
0.05
0
0
0.2
0.4
0.6
0.8
C
Figure 9: Varying Sensitivity Graphs for Generated Data Set 1
1