reference Narasimhamurthy and Kuncheva, 2007

A FRAMEWORK FOR GENERATING DATA TO SIMULATE CHANGING
ENVIRONMENTS
Anand Narasimhamurthy
School of Electronics and Computer Science,
University of Wales, Bangor,
LL57 1UT, United Kingdom
[email protected]
Ludmila I. Kuncheva
School of Electronics and Computer Science,
University of Wales, Bangor,
LL57 1UT, United Kingdom
[email protected]
ABSTRACT
A fundamental assumption often made in supervised classification is that the problem is static, i.e. the description
of the classes does not change with time. However many
practical classification tasks involve changing environments. Thus designing and testing classifiers for changing
environments are of increasing interest and importance. A
number of benchmark data sets are available for static classification tasks. For example, the UCI machine learning
repository is extensively used by researchers to compare
algorithms across various domains. No such benchmark
datasets are available for changing environments. Also,
while generating data for static environments is relatively
straightforward, this is not so for changing environments.
The reason is that an infinite amount of changes can be
simulated, and it is difficult to define which ones will
be realistic and hence useful. In this paper we propose
a general framework for generating data to simulate
changing environments. The paper gives illustrations of
how the framework encompasses various types of changes
observed in real data and also how the two most popular
simulation models (STAGGER and moving hyperplane)
are represented within.
“spam” and “non-spam” evolve with time. They are userspecific, and users’ preferences change with time. Also
spammers are active opponents who devise newer methods to evade detection thereby altering the definition of
class “spam” within the current feature space. A change
in the environment is referred to as concept drift in machine learning. The changes could include minor fluctuations of the underlying probability distributions, steady
trends, rapid substitution of one classification task with another and so on [12, 33]. These categories are by no means
mutually exclusive and various combinations are possible.
Bespoke online classification models [5, 6, 32], as well as
classifier ensembles [21,29] have been proposed for changing environments [22].
In the light of the growing importance of classification in changing environments, it is striking how little attention has been paid to putting together a collection of
benchmark data sets or formulating generic procedures for
simulating data. As a result researchers use their own
simulated datasets or specific real data domains such as
finance [8, 12, 32], text categorisation [17–19], spam filtering [4] and web searches [11]. In spite of some critiques [27], the creation of the UCI Machine Learning
Repository [3] in 1987 has had a significant positive impact on Machine Learning and Pattern Recognition Communities [1]. A KDD (Knowledge Discovery in Databases)
archive was subsequently developed in 1999 [1] containing large data sets of diverse types and domains (http:
//kdd.ics.uci.edu). Data with a variety of known
changes are needed for evaluating classification methods
in changing environments. In this study we propose a general framework for generating artificial data which simulate
changing environments. The data may contain one or more
types of drift as desired by the user.
KEY WORDS
Machine Learning, Changing environments, Concept drift,
Population drift, Artificial data, Simulated data.
1 Introduction
Online learning is increasingly occupying centre stage in
data mining as more and more applications generate massive streams of data [5, 6, 29]. In these cases it is often
impractical to apply conventional classification methodologies since these typically are batch algorithms often requiring more than one pass through the entire data. Many online versions of conventional classification methods have
been developed [23–25].
A fundamental assumption in supervised classification, including many of the online classification methods, is
that the problem is static, i.e. the description of the classes
does not change with time. In many practical problems this
is most likely to be false. An example is detecting and filtering out spam-email. The descriptions of the two classes
549-167
The rest of the paper is organised as follows. In Section 2 we review the terminology used in the literature on
changing environments. The proposed framework is described in Section 3. Illustrations of how the proposed
framework accommodates the defined types of environmental changes are given in Section 4. An implementation
algorithm is given in Section 5. Section 6 concludes the
study.
384
2 Terminology
because authors make distinctions.
The most used term in the ML literature on the subject is
‘concept drift’. However, there is a variety of other terms,
some of them not precisely defined, taken to mean the same
thing. This lack of uniformity of terminology prompted
us to try to summarise and explain the terms found in the
literature before proceeding with the general framework.
The term ’concept’ is used to mean slightly different
things. It is sometimes used to refer to the class of interest.
A concept change in this case would mean a deviation of
the description of the class from the original description.
Consider for example the STAGGER dataset described by
Widmer and Kubat [33] (this is discussed further in section
4.2). In this dataset the feature space of a simple blocks
world is described by three features (attributes) namely,
size, color and shape. Examples are labelled according to a
target concept (this is not given to the classifier explicitly)
which changes over time. For example the target concept
for the first set of instances is ”size = small AND color =
red”, the target concept for a subsequent set of instances is
”color = green OR shape = circular”. An analogous ’target
concept’ or a ’concept of interest’ in a real world domain
such as information filtering, may be the class ‘relevant articles’ from a set of documents (e.g. Klinkenberg [14, 18]).
In other articles, the authors use ‘concept’ to refer to
the whole distribution of the problem at time moment , including all classes. Then concept drift is the substitution of
. To avoid
one concept at time with another at time
confusion, in this study we will try to avoid the term altogether. Another frequently used term is population drift,
this refers to the changes in the underlying distributions.
Kelly et al. [12] use the term population drift to mean a
changes either in the priors or in the probabilities probabilities of class membership, conditional on feature vectors.
They point out that the term concept drift has been used
in the machine learning literature both for this (population
drift) as well as other changes.
While in a natural system we can expect gradual drifts
(e.g., seasonal, demographic, habitual, etc.), sometimes the
class description may change rapidly due to so called hidden contexts. Such contexts may be, for example, illumination in image recognition and accents in speech recognition [34]. The context might instantly become highly relevant. Consider a system trained on images with similar
illumination. If images with the same type of content but
a different illumination are fed to the system, the class descriptions might change so as to make the system worse
than a random guess.
The type of changes can be roughly summarised as
follows. They are briefly discussed in subsequent sections.
These categories are by no means mutually exclusive and
various combinations are possible. For example, recurring
trends and population drift are both subsumed by the other
categories, namely gradual changes and concept substitution. These are listed separately however, since they are
changes which occur in real world applications and also
Gradual changes (gradual drift, evolutionary changes,
concept drift) [2, 10, 13, 16, 30]
Substitutions (abrupt changes, revolutionary changes,
concept substitution, concept shift) [10, 16, 33]
Recurring trends (recurring contexts) [31, 33]
Population drift [12]
3 A general framework for simulating
changing environments
Every classification problem, however complex it might
be, may be described completely by the following. Let
be an object in the -dimensional feature space
of the problem and be the set of class
labels. The knowledge of the prior probabilities for the
classes, and the class-conditional probability den
sity functions (pdf) , , determine completely the problem and the optimal (minimum error) classifier for it. Viewed in this probabilistic sense, a classification problem may change due to the changes in the and/or . Posterior probabilities and the unconditional pdf may also change but these changes
can be re-expressed in terms of and .
Without loss of generality assume that all data live in
the -dimensional real space, . Consider a set of data sources with known distributions. The distribution for
source is characterised by the class-conditional probability density functions , and the prior probabilities
, , for this source, . At
any time we have one or more “active” data sources. Let
specify the extent of the influence of data
source at time . We shall treat the influences as mixing
proportions
and the resultant distribution as a mixture, i.e.
! for any . At time the data distribution " #$ %
is characterised by prior probabilities
&
'
( and class-conditional pdfs
&
'
( As the distributions at the sources are fixed, the data distribution at moment , " #$ % is specified through . Thus
we can equivalently define " #$ % as
&
" #$ % To be able to cater for all possible scenarios of changes we
will allow the set of sources to be as large as necessary. For
example, if 1000 time instances are needed in a population
drift data, the set of sources might need to be of cardinality
1000, one source for each time instance.
385
transition. At time instance , " #$ % will be a set of binary
mixing proportions, with for the data source cor
. To make
responding to , and all other the model plausible, the consecutive data sources should be
can
closely related. For example, data source for time
be obtained from data source for time by tweaking a parameter. In this example, the starting and the final distributions are the same as in the first example. However, at each
time , only one source is sampled, so the two classes in the
example remain separable at any transition time moment .
Figure 3 shows the scatterplot of the data. Note that this
time there are four data sources with their corresponding
.
This second example of gradual change illustrates a
popular artificial data type used by many authors to test
their classification methods for changing environments.
This data simulation idea is often called the moving hyperplane [7, 11, 21].
The proposed framework is meant to be a versatile
tool to combine the scenarios of the two examples by letting
many data sources to be sampled at a time with mixing
proportions evolving with time. It is hoped that in this way
the generated data will represent a more realistic model of
real-life changing environments.
1
1
0.5
2
time
0
Figure 1. Mixing proportions and gradual change between two data sources
for a linear
4 Illustration of the types of changes within
the proposed framework
4.1 Gradual drift
Two examples of gradual drift are given below. The first
example represents only two sources with gradual changes
from one source to the other, as proposed by Widmer and
Kubat [33] and Black and Hickey [2]. This is a scenario
where one concept fades gradually while the other takes
over. The real-world example given by Widmer and Kubat
is that of a device which begins to malfunction. At first
only small number of data points will come from the stable
failure state. Finally the failure will take over completely.
Figure 1 shows and as functions of for this
case.
Figure 2 (a) represents the starting data where the distribution is equivalent to that of source 1, i.e., ,
if
otherwise
if
otherwise
4.2 Substitutions
This is the type of change most often simulated in order to
assess the performance of a classifier in changing environment. The classical example is the STAGGER concept first
proposed by Schlimmer and Granger [28]. It has received
the status of a benchmark artificial data by being used by
many authors since then [4, 7, 20–22, 33].
The feature space is described by three features (attributes): size small (1), medium (2), large (3) , color
red (a), green (b), blue (c) and shape
square (A),
circular (B), triangular (C) . There are three data sources
(called ‘target concepts’)
The end distribution is that of source 2, shown in Figure 2 (d). Subplots (b) and (c) plot the transition of the
data distribution for equal to (0.67,0.33) and
(0.33,0.67), respectively.
The second example of gradual change is when there
are more sources so the mixture components at any time
of the transition are not just the start and the end ones.
This example mimics population drift whereby the parameters of the distributions change gradually with time.
An example from real-life would be the change of category “high risk” in loan approval. The criteria for a
money lender to approve a loan evolve with time to respond to new socio-economic circumstances; the gross annual salaries increase; the market is getting saturated with
loan providers. All this will alter the class distributions,
so class “high risk” from a few years ago will have a new
profile now. To model gradual changes in the parameters
of the distributions, we assume that there are as many data
sources as there are time instances to be recorded in the
Target Concept 1 : size = small AND color = red
Target Concept 2 : color = green OR shape = circular
Target Concept 3 : size = medium OR size = large
In their experiment, Widmer and Kubat [33] generate
randomly (from a uniform distribution across the feature
space) 120 training instances and label each instance according to the current target concept. After processing each
instance, the predictive accuracy is tested on an independent test set of 100 instances, also generated randomly. One
target concept is active at a time. Concept 1 is active from
instance 1 to instance 40, Concept 2 is active from instance
41 to 80 and Concept 3 from instance 81 to 120.
This example is modelled as follows. Consider three
data sources each related to one of the target concepts. The
distributions at the three data sources are
386
1.5
1.5
1.5
1
1
1
1.5
1
0.5
0.5
0.5
0.5
0
0
0
0
−0.5
−0.5
−0.5
−0.5
−1
−1
−1
−1
−1.5
−1.5
−1.5
−1.5
−1
0
1
−1
(a)
0
1
−1
(b)
0
1
−1
(c)
0
1
(d)
Figure 2. Scatterplots of the data illustrating a gradual (linear) change from distribution of source 1 (a) to that of source 2 (e).
Class is depicted with grey dots, and class with black dots.
1.5
1.5
1.5
1.5
1
1
1
1
0.5
0.5
0.5
0.5
0
0
0
0
−0.5
−0.5
−0.5
−0.5
−1
−1
−1
−1
−1.5
−1
0
−1.5
1
−1
0
1
−1.5
(a)
−1
0
1
−1.5
(b)
−1
0
1
(c)
(d)
Figure 3. Scatterplots of the data illustrating a gradual (linear) change using 4 data sources. Class is depicted with grey dots,
and class with black dots.
, ,
for instances 1aA, 1aB and 1aC and
0 for all other instances. for instances
1aA, 1aB and 1aC and 1/24 for all other instances.
4.3 Recurring trends
Data source 1.
, ,
for instances xby and xzB, where x
is any of 1,2,3 y is any of A,B,C and z is any of a,b,c.
for all other instances. for all other
instances. for xby and xzB and 1/12 for
all remaining instances.
Data source 2.
In many applications contexts tend to recur either cyclically
or in an unordered fashion. Biological and economic systems tend to go through cycles of development with recurring patterns [33]. Recurring contexts may be due to irregular phenomena such as inflation rates or market mood [9]
or due to cyclic phenomena such as seasonal changes. For
example, Tsymbal [31] deal with antibiotic resistance data
for nosocomial infections. They discovered seasonal context recurring with winter and spring models, these corresponded to yearly spring and winter infection outbreaks.
With respect to our framework, the phenomenon of a
periodically recurring context can be expressed as
, ,
for instances for 2zy and 3zy and 0
for all remaining instances. for 2zy and
3zy and 1/9 for all remaining instances.
Data source 3.
" #$ % " #$ %
For the STAGGER data,
for
for for
where is the period of repetition.
and 0 for any other
and 0 for any other
and 0 for any other
5 Implementation algorithm
Figure 4 gives an algorithm which implements the proposed framework. 1
In the substitution scenario, usually one data source is
instantly replaced by another, so " #$% contain again binary
mixing proportions as in the second example above. However, this time the subsequent data sources are assumed to
be significantly different from one another.
1 MATLAB code is available at the following url.
http:
//www.informatics.bangor.ac.uk/˜kuncheva/
activities/epsrc2006/changing_environments.html
387
ods for problems with changing environments, it will not
be long before a valuable benchmark of such data will become publicly available, following in the footsteps of UCI
ML repository [3] and UCI KDD Database [1].
Input:
number of time instances required
( = number of data points)
number of data sources
a user-defined function which
samples once from data source an
array with the mixing
proportions (column contains
)
Acknowledgments
This work was supported by EPSRC grant EP/D04040X/1.
References
Data generation:
Initialise the data set For (1)
= Sample once from with
distribution
defined by
(2)
Sample a new data point
from source , ,
[1] S. D. Bay, D. Kibler, M. J. Pazzani, and P. Smyth. The
UCI KDD archive of large data sets for data mining
research and experimentation. SIGKDD Exploration,
2(2):14–18, 2000.
and add it to the current set (the class label is also added)
End Return the data set
[2] M.M Black and R.J. Hickey. Maintaining the performance of a learned classifier under concept drift.
Intelligent Data Analysis, 3(6):453–474, 1999.
[3] C. L. Blake and C. J. Merz.
UCI repository of machine learning databases, 1998.
http://www.ics.uci.edu/˜mlearn/
MLRepository.html.
Figure 4. Algorithm: Generating data to simulate changing
environments
[4] S. J. Delany, P. Cunningham, and A. Tsymbal. A
comparison of ensemble and case-based maintenance
techniques for handling concept drift in spam filtering. Technical Report TCD-CS-2005-19, Trinity College Dublin, 2005.
6 Conclusion
[5] P. Domingos and G. Hulten. Mining high-speed data
streams. In Knowledge Discovery and Data Mining,
pages 71–80, 2000.
We propose a general framework to generate data simulating changing environments. The framework is meant to
channel the effort in developing a test-bed of artificial data
to train and test classifiers under various modes of changing environments. The framework accommodates the currently favourite STAGGER and Moving Hyperplane generation strategies and offers a versatile approach towards
devising new strategies. We show that the framework covers the three main types of changes as formulated in the
literature: gradual changes, substitution and systematic (recurrent) trends. These all can be simulated individually or
in combination. An implementation algorithm is detailed
as well.
Our future work includes simulating changes on real
data. Current state-of-the-art is relatively poor in this respect, revolving around dropping classes and adding data
from new classes [15, 26] or tweaking the class labels [2].
One possible way to go is to use one of the features as hidden context and sort the data set by that feature to simulate
a trend.
For now we have looked into generating artificial data
but the main focus must be maintained on the possibility of
finding real data sets which exhibit a clear case of changing environments. Many such datasets, e.g. in finance,
are not freely available as benchmark due to privacy limitations. Given the pressing need for new tools and meth-
[6] P. Domingos and G. Hulten. A general framework for
mining massive data streams. Journal of Computational and Graphical Statistics, 12:945–949, 2003.
[7] F. Ferrer-Troyano, J. S. Aguilar-Ruiz, and J. C.
Riquelme. Incremental rule learning based on example nearness from numerical data streams. In 2005
ACM Symposium on Applied Computing, SAC 05,
Santa Fe, New Mexico,USA, 2005. ACM Press.
[8] M. Harries and K. Horn. Detecting concept drift in
financial time series prediction using symbolic machine learning. In Eighth Australian Joint Conference on Artificial Intelligence, pages 91–98, Singapore, 1995. World Scientific Publishing.
[9] M. Harries, C. Sammut, and K. Horn. Extracting hidden context. Machine Learning, 32:101–126, 1998.
[10] R.J. Hickey and M. M. Black. Refined time stamps
for concept drift detection during mining for classification rules. In Proceedings of the International
Workshop on Temporal, Spatial and Spatio-Temporal
Data Mining (TSDM2000), Lecture Notes in Artificial
388
Intelligence, volume 2007, pages 20–30. SpringerVerlag: Berlin, 2000.
on Data Mining, pages 123–130, Los Alamitos, CA,
2003. IEEE Press.
[11] G. Hulten, L. Spencer, and P. Domingos. Mining time-changing data streams. In In Proc. 7th
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 97–106.
ACM Press, 2001.
[22] L. I. Kuncheva. Classifier ensembles for changing environments. In 5th International Workshop on Multiple Classifier Systems, MCS 04, LNCS, volume 3077,
pages 1–15. Springer-Verlag, 2004.
[23] H. K. H. Lee and M. A. Clyde. Lossless online
Bayesian bagging. Journal of Machine Learning Research, 5:143–151, 2004.
[12] Mark G. Kelly, David J. Hand, and Niall M. Adams.
The impact of changing populations on classifier performance. In KDD ’99: Proceedings of the fifth ACM
SIGKDD international conference on Knowledge discovery and data mining, pages 367–371, New York,
NY, USA, 1999. ACM Press.
[24] N. C. Oza. Online Ensemble Learning. PhD thesis,
University of California, Berkeley, 2001.
[25] N. C. Oza and S. Russell. Online bagging and boosting. In Eighth International Workshop on Artificial
Intelligence and Statistics, Key West, Florida, 2001.
[13] M. Klenner and U. Hahn. Concept versioning : A
methodology for tracking evolutionary concept drift
in dynamic concept systems. In 11th European
Conference on Artificial Intelligence, pages 473–477,
Chichester, England, 1994. Wiley.
[26] R. Polikar, L. Udpa, S. S. Udpa, and V. Honavar.
Learn++: An incremental learning algorithm for supervised neural networks. IEEE Transactions on Systems, Man and Cybernetics, 31(4):497–508, 2001.
[14] R. Klinkenberg. Learning drifting concepts with partial user feedback. In P. Perner and V. Fink, editors, Beitrge zum Treffen der GI-Fachgruppe 1.1.3
Maschinelles Lernen (FGML-99), Magdeburg, Germany, 1999.
[27] S. Salzberg. On comparing classifiers: A critique
of current research and methods. Data Mining and
Knowledge Discovery, 1:1–12, 1999.
[28] J. C. Schlimmer and R. H. Granger. Incremental learning from noisy data. Machine Learning,
1(3):317–354, 1986.
[15] R. Klinkenberg. Using labeled and unlabeled data
to learn drifting concepts. In Workshop notes of the
IJCAI-01 Workshop on Learning from Temporal and
Spatial Data, pages 16–24, Menlo Park, CA, USA,
2001. IJCAI, AAAI Press.
[29] W. N. Street and Y. S. Kim. A streaming ensemble
algorithm (SEA) for large-scale classification. In Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
pages 377–382. ACM Press, 2001.
[16] R. Klinkenberg. Learning drifting concepts: Example selection vs. example weighting. Intelligent Data
Analysis, 8(3), 2004.
[30] A. Tsymbal. The problem of concept drift: definitions
and related work. Technical Report TCD-CS-200415, Trinity College Dublin, April 2004.
[17] R. Klinkenberg and T. Joachims. Detecting concept
drift with support vector machines. In Proceedings
of the Seventeenth International Conference on Machine Learning (ICML), pages 487–494, San Francisco, CA, USA, 2000. Morgan Kaufmann.
[31] A. Tsymbal, M. Pechenizkiy, P. Cunningham, and
S. Puuronen. Handling local concept drift with dynamic integration of classifiers: domain of antibiotic
resistance in nosocomial infections. In Proc 19th
IEEE Int. Symposium on Computer-Based Medical
Systems CBMS 2006, Salt Lake City, UT, June 2006.
IEEE CS Press.
[18] R. Klinkenberg and I. Renz. Adaptive information filtering: Learning in the presence of concept drifts. In
AAAI-98/ICML-98 workshop Learning for Text Categorization, Menlo Park,CA, 1998.
[19] R. Klinkenberg and S. Rping. Concept drift and the
importance of examples. In Text Mining – Theoretical Aspects and Applications, pages 55–77. PhysicaVerlag, Berlin, Germany, 2003.
[32] H. Wang, W. Fan, P. S. Yu, and J. Han. Mining
concept drifting data streams using ensemble classifiers. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining, pages 226–235. ACM Press, 2003.
[20] J. Kolter. Using additive expert ensembles to cope
with concept drift - A senior thesis, April 2005. A
senior thesis.
[33] G. Widmer and M. Kubat. Learning in the presence of
concept drift and hidden contexts. Machine Learning,
23:69–101, 1996.
[34] G. Widmer and M. Kubat. Special Issue on Context
Sensitivity and Concept Drift. Machine Learning, 32,
1998.
[21] J. Z. Kolter and M. A. Maloof. Dynamic weighted
majority: A new ensemble method for tracking concept drift. In Proc 3rd International IEEE Conference
389